What's a good Perl regex to untaint an absolute path? - regex

Well, I tried and failed so, here I am again.
I need to match my abs path pattern.
/public_html/mystuff/10000001/001/10/01.cnt
I am in taint mode etc..
#!/usr/bin/perl -Tw
use CGI::Carp qw(fatalsToBrowser);
use strict;
use warnings;
$ENV{PATH} = "bin:/usr/bin";
delete ($ENV{qw(IFS CDPATH BASH_ENV ENV)});
I need to open the same file a couple times or more and taint forces me to untaint the file name every time. Although I may be doing something else wrong, I still need help constructing this pattern for future reference.
my $file = "$var[5]";
if ($file =~ /(\w{1}[\w-\/]*)/) {
$under = "/$1\.cnt";
} else {
ErroR();
}
You can see by my beginner attempt that I am close to clueless.
I had to add the forward slash and extension to $1 due to my poorly constructed, but working, regex.
So, I need help learning how to fix my expression so $1 represents /public_html/mystuff/10000001/001/10/01.cnt
Could someone hold my hand here and show me how to make:
$file =~ /(\w{1}[\w-\/]*)/ match my absolute path /public_html/mystuff/10000001/001/10/01.cnt ?
Thanks for any assistance.

Edit: Using $ in the pattern (as I did before) is not advisable here because it can match \n at the end of the filename. Use \z instead because it unambiguously matches the end of the string.
Be as specific as possible in what you are matching:
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
if ( $fn =~ m!
^(
/public_html
/mystuff
/[0-9]{8}
/[0-9]{3}
/[0-9]{2}
/[0-9]{2}\.cnt
)\z!x ) {
print $1, "\n";
}
Alternatively, you can reduce the vertical space taken by the code by putting the what I assume to be a common prefix '/public_html/mystuff' in a variable and combining various components in a qr// construct (see perldoc perlop) and then use the conditional operator ?::
#!/usr/bin/perl
use strict;
use warnings;
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
my $prefix = '/public_html/mystuff';
my $re = qr!^($prefix/[0-9]{8}/[0-9]{3}/[0-9]{2}/[0-9]{2}\.cnt)\z!;
$fn = $fn =~ $re ? $1 : undef;
die "Filename did not match the requirements" unless defined $fn;
print $fn, "\n";
Also, I cannot reconcile using a relative path as you do in
$ENV{PATH} = "bin:/usr/bin";
with using taint mode. Did you mean
$ENV{PATH} = "/bin:/usr/bin";

You talk about untainting the file path every time. That's probably because you aren't compartmentalizing your program steps.
In general, I break up these sort of programs into stages. One of the earlier stages is data validation. Before I let the program continue, I validate all the data that I can. If any of it doesn't fit what I expect, I don't let the program continue. I don't want to get half-way through something important (like inserting stuff into a database) only to discover something is wrong.
So, when you get the data, untaint all of it and store the values in a new data structure. Don't use the original data or the CGI functions after that. The CGI module is just there to hand data to your program. After that, the rest of the program should know as little about CGI as possible.
I don't know what you are doing, but it's almost always a design smell to take actual filenames as input.

Related

Multiple grep (with regex's) functions not working in Perl script

Having trouble with a script right now.
Trying to filter out portions of a file and put them into a scalar. Here is the code.
#value = (grep {m/(III[ABC])/g and m//g }<$fh>)
print #value;
#value = (grep { m/[012]iii/g}(<$fh>));
print #value;
When I run the first grep , the values appear in the print statment. But when I run the second grep. The 2nd print statement doesnt print anything. Does adding a second grep, cancel out the effectiveness of the first grep ?
I know that first and second grep work because even when I commented out the first grep. The second grep function worked.
All I really want to do, is filter out information, for multiple different individual arrays. I am really confused as to how to fix this problem, since I am planning on adding more grep's to the script.
The first read on <$fh> gets to the end of the file. Then the second invocation has nothing to read. Thus if you comment out the first one this doesn't happen and the second one works.
The code below adds to the same array. Change to the commented out code if needed. The regex is simplified, since it requires a comment while it doesn't affect the actual question. Please put it back the way it was, if that was what you really meant.
You can either rewind the filehandle after all lines have been read
my #vals = grep { /III[ABC]/ } <$fh>;
seek $fh, 0, 0;
# ready for reading again from the beginning
push #vals, grep { /[012]iii/ } <$fh>;
#or: my #vals_2 = grep { /[012]iii/ } <$fh>;
Or you can read all lines into an array that you can then process repeatedly.
my #original = <$fh>;
my #vals = grep { /III[ABC]/ } #original;
push #vals, grep { m/[012]iii/ } #original;
# or assign to a different array
If you don't need to store these results in such order it would be far more efficient to read the file line by line, and process and add as you go.
Update
I simplified the originally posted regex in order to focus on the question at hand, since the exact condition inside the block has no bearing on it. See the Note below. Thanks to ikegami for bringing it up and for explaining that // "repeats the last successful query".
The m//g is tricky and I removed it.
grep checks a condition and passes a line through if the condition evaluates true. In such scalar context /.../g modifier has effects which are a very different story, removed.
For the same reason as above, the capturing () is unneeded (excessive).
Cleaning up the syntax helps readability here, removed m/.
Note on regex
In scalar context /.../g modifier does the following, per perlrequick:
successive matches against a string will have //g jump from match to match
The empty string pattern m//g also has effects which are far from obvious, stated above.
Taken together these produce non-trivial results in my tests and need mental tracing to understand. I removed them from the code here since leaving them begs a question on whether they are intended trickery or subtle bugs, thus distracting from the actual question -- which they do not affect at all.
I don't know what you think the g modifier does, but it makes no sense here.
I don't know what you think // (a match with an empty pattern) does, but it makes no sense here.
In list context, <$fh> returns all remaining lines in the file. It returns nothing the second time you evaluate it since since you've already reached the end of the file the first time you evaluated it.
Fix:
my #lines = <$fh>;
my #values1 = grep { /III[ABC]/ && /.../ } #lines;
my #values2 = grep { /[012]iii/ } #lines;
Of course, substitute ... for what you meant to use there.

What is the quickest way to determine if a file contains a specific string?

I'm new to Perl and is working on a problem to replace a string in some files to another one, the only way I know is like the following:
#!/usr/bin/perl
$file = "default.properties";
open (IN, $file) || die "Cannot open file ".$file." for read";
#lines=<IN>;
close IN;
open (OUT, ">", $file) || die "Cannot open file ".$file." for write";
foreach $line (#lines)
{
$line =~ s/hello/hello hello hello/ig;
print OUT $line;
}
close OUT;
this treats each file equally and scan the lines of each file one by one, it would waste a lot of time if the contain does not contain the string I want to replace. I'm wondering if there's a way (like hash) to determine if a file contains a specific string?
P.S. Is there a quicker way for string replacement in a file instead of scan lines of it sequentially to find the line matches and then replace?
I'm new to Perl
This has nothing to do with your immediate question, but you should get a good book on Modern Perl.
Perl has greatly changed over the years, and the way you write in Perl has changed. Since you're just starting out, might as well do it right. Looking at your code, it looks like you're picking up the coding style from the older releases of Perl.
Now to your question:
this treats each file equally and scan the lines of each file one by one, it would waste a lot of time if the contain does not contain the string I want to replace. I'm wondering if there's a way (like hash) to determine if a file contains a specific string?
In the end, you have to read the entire file. There's no simple way around that. Yes, you could make your code shorter, but a read operation reads a file bit-by-bit and the substitute substitutes on the file bit-by-bit. Shorter code doesn't necessarily mean it's more efficient.
Here's your program written in a more modern style.
#! /usr/bin/env perl
use strict;
use warnings;
use autodie; # Automatically kills your program on file errors
use feature qw(say); # Automatically adds the \n on the end.
use File::Copy; # Gives me the "move" command
my $file = "default.properties";
open my $in_fh, "<", $file;
open my $out_fh, ">", "$file.temp"; #Can't open a file for reading and writing at the same time!
while ( my $line = < $in_fh > ) {
chomp $line; # I always recommend that you chomp when you read.
$line =~ s/hello/hello hello hello/;
say {$out_fh} $line;
}
close $in_fh;
close $out_fh;
move "$file.temp", $file;
As you can see, this is still processing a line at a time.
Here are some of the items in the above:
use strict; - Requires you to declare variables before using
use warnings; - Prints out all sorts of warnings like undefined variables
use autodie; - Automatically kills your program when a file operation fails. This can save you a lot of grief if you forget to check whether or not something worked.
use feature qw(say); - Implements the "say" command. This is like print, but automatically adds in a New Line on the end.
use File::Copy; - Gives you the move command. You can't easily read and write to the same file. Therefore, I had to use a different file name for input and output. Better would be File::Temp which allows you to define temporary files that are guaranteed to be unique.
open - Use scalar variables for file handles. It makes it easier to pass a file handle to a function.
while - A for loop has to read in the entire file before pressing. A while loop reads in a file line-by-line. Always use a while when reading in a file in a loop.
You can eliminate the loop, but it doesn't mean the code is that much more efficient:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie; # Automatically kills your program on file errors
use feature qw(say);
my $file = "default.properties";
open my $in_fh, "<", $file;
open my $out_fh, ">", "$file.temp";
my #lines = < $in_fh >; #Read in all the lines at once
map { s/hello/hello hello hello/; } #lines;
say {$out_fh} join "", #lines;
close $in_fh;
close $out_fh;
move "$file.temp", $file;
This is using map which is a way of operating on an array without an explicit loop. It's a tricky command to understand, but it acts as a loop on the array you're passing to it. This is changing each entry in #lines with the substitute command enclosed in the curly braces. You'll see this a lot in Perl, and it can be cleaner in many cases than a for loop.
Finally, you could put the entire file into a single scalar variable (including new lines) and doing a substitution on that:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie; # Automatically kills your program on file errors
use feature qw(say);
my $file = "default.properties";
open my $in_fh, "<", $file;
open my $out_fh, ">", "$file.temp";
my #lines = < $in_fh >; #Read in all the lines at once
$file = join "", #lines # Converts file to one long scalar variable
$lines =~ s/hello/hello hello hello/g;
say {$out_fh} $lines;
close $in_fh;
close $out_fh;
move "$file.temp", $file;
Is this more efficient? I doubt it. Regular expressions are not very efficient statements, and doing a regular expression on a multi-lined, very long, scalar variable isn't going to be efficient.
True efficiency is a readable, maintainable program. You probably will spend a lot more time on maintenance than the length of time the program actually runs. This last example is harder to understand and probably more difficult to modify. Better sticking with either map or the while loop.
No, there is no magic way to know if a file contains a string in advance.
I'd advise doing line by line processing instead of slurping the entire file.
You can do this using perl's $INPLACE_EDIT to edit a file as demonstrated below, or check out one of the many other methods listed in perlfaq5 - How do I change, delete, or insert a line in a file, or append to the beginning of a file.
#!/usr/bin/perl
use strict;
use warnings;
my $file = "default.properties";
local #ARGV = $file;
local $^I = '.bak';
while (<>) {
s/hello/hello hello hello/ig;
print;
}
unlink "$file$^I"; # Delete backup
Or the equivalent in a one-liner
perl -i -pe 's/hello/hello hello hello/ig;' default.properties.
I'm wondering if there's a way (like hash) to determine if a file contains a specific string?
Not really, no.
Is there a quicker way for string replacement in a file instead of scan lines of it sequentially to find the line matches and then replace?
Also no.
That said, your perl script might not be as fast or optimized as some other options; for your case, most notably sed(1):
sed -i -e 's/hello/hello hello hello/g' default.properties

Extract unique lines from files (with a pattern) recursively from directory/subdirectories

I have a huge java codebase (more than 10,000 java classes) that makes extensive use of CORBA (no documentation available on its usage though).
As first step to figure out the CORBA usage, I decided to scan entire codebase and extract/print unique lines which contain the pattern "org.omg.CORBA". These are usually in the import statements (e.g. import org.omg.CORBA.x.y.z).
I am newbie to Perl and want to know if there is a way I can extract these details on Windows OS. I need to be able to scan all folders (and sub-folders) that have java classes.
You can use File::Find in a one-liner:
perl -MFile::Find -lwe "
find(sub { if (-f && /\.java$/) { push #ARGV,$File::Find::name } },'.');
while(<>) { /org.omg.CORBA/ && $seen{$_}++; };
print for keys %seen;"
Note that this one-liner is using the double quotes required for Windows.
This will search the current directory recursively for files with extension .java and add them to the #ARGV array. Then we use the diamond operator to open the files and search for the string org.omg.CORBA, and if it is found, that line is added as a key to the %seen hash, which will effectively remove duplicates. The last statement prints out all the unique keys in the hash.
In script form it looks like this:
use strict;
use warnings;
use File::Find;
find(sub { if (-f && /\.java$/) { push #ARGV,$File::Find::name } },'.');
my %seen;
while(<>) {
/org.omg.CORBA/ && $seen{$_}++;
}
print "$_\n" for keys %seen;"
Just for fun, a perl one-liner to do this:
perl -lne '/org.omg.CORBA/ and (++$seen{$_}>1 or print)' *
This first checks if a line matches and then if it has not seen it before prints out the line. That is done for all files specified (in this case '*').
i don't mean to be contrarian, but i'm not sure perl is the best solution here. nhahtdh's suggestion of using cygwin is a good one. grep or find is really what you want. using perl in this instance will involve using File::Find and then opening a filehandle on every file. that's certainly do-able, but, if possible, i'd suggest using the right tool for the job.
find . -name "*.java" -type f | xargs grep -l 'org.com.CORBA' | sort | uniq
if you really must use perl for this job we can work up the File::Find code.

How to regex search/replace with File::Map in largish text file with to avoid "Out of Memory"-Error?

UPDATE 2: Solved. See below.
I am in the process of converting a big txt-file from an old DOS-based library program into a more usable format. I just started out in Perl and managed to put together a script such as this one:
BEGIN {undef $/; };
open $in, '<', "orig.txt" or die "Can't read old file: $!";
open $out, '>', "mod.txt" or die "Can't write new file: $!";
while( <$in> )
{
$C=s/foo/bar/gm;
print "$C matches replaced.\n"
etc...
print $out $_;
}
close $out;
It is quite fast but after some time I always get an "Out of Memory"-Error due lack of RAM/Swap-Space (I'm on Win XP with 2GB of Ram and a 1.5GB Swap-File).
After having looked around a bit on how to deal with big files, File::Map seemed to me as a good way to avoid this problem. I'm having trouble implementing it, though.
This is what I have for now:
#!perl -w
use strict;
use warnings;
use File::Map qw(map_file);
my $out = 'output.txt';
map_file my $map, 'input.txt', '<';
$map =~ s/foo/bar/gm;
print $out $map;
However I get the following error: Modification of a read-only value attempted at gott.pl line 8.
Also, I read on the File::Map help page, that on non-Unix systems I need to use binmode. How do I do that?
Basically, what I want to do is to "load" the file via File::Map and then run code like the following:
$C=s/foo/bar/gm;
print "$C matches found and replaced.\n"
$C=s/goo/far/gm;
print "$C matches found and replaced.\n"
while(m/complex_condition/gm)
{
$C=s/complex/regex/gm;
$run_counter++;
}
print "$C matches replaced. Script looped $run_counter times.\n";
etc...
I hope that I didn't overlook something too obvious but the example given on the File::Map help page only shows how to read from a mapped file, correct?
EDIT:
In order to better illustrate what I currently can't accomplish due to running out of memory I'll give you an example:
On http://pastebin.com/6Ehnx6xA is a sample of one of our exported library records (txt-format). I'm interested in the +Deskriptoren: part starting on line 46. These are thematic classifiers which are organised in a tree hierarchy.
What I want is to expand each classifier with its complete chain of parent nodes, but only if none of the parent nodes are not already present before or after the child node in question. This means turning
+Deskriptoren
-foo
-Cultural Revolution
-bar
into
+Deskriptoren
-foo
-History
-Modern History
-PRC
-Cultural Revolution
-bar
The currently used Regex makes use of Lookbehind and Lookahead in order to avoid duplicates duplicates and is thus slightly more complicated than s/foo/bar/g;:
s/(?<=\+Deskriptoren:\n)((?:-(?!\QParent-Node\E).+\n)*)-(Child-Node_1|Child-Node_2|...|Child-Node_11)\n((?:-(?!Parent-Node).+\n)*)/${1}-Parent-Node\n-${2}\n${3}/g;
But it works! Until Perl runs out of memory that is... :/
So in essence I need a way to do manipulations on a large file (80MB) over several lines. Processing time is not an issue. This is why I thought of File::Map.
Another option could be to process the file in several steps with linked perl-scripts calling each other and then terminating, but I'd like to keep it as much in one place as possible.
UPDATE 2:
I managed to get it working with Schwelm's code below. My script now calls the following subroutine which calls two nested subroutines. Example code is at: http://pastebin.com/SQd2f8ZZ
Still not quite satisfied that I couldn't get File::Map to work. Oh well... I guess that the line-approach is more efficient anyway.
Thanks everyone!
When you set $/ (the input record separator) to undefined, you are "slurping" the file -- reading the entire content of the file at once (this is discussed in perlvar, for example). Hence the out-of-memory problem.
Instead, process it one line at a time, if you can:
while (my $line = <$in>){
# Do stuff.
}
In situations where the file is small enough and you do slurp the file, there is no need for the while loop. The first read gets everything:
{
local $/ = undef;
my $file_content = <>;
# Do stuff with the complete file.
}
Update
After seeing your massive regex I would urge you reconsider your strategy. Tackle this as a parsing problem: process the file one line at a time, storing information about the parser's state as needed. This approach allows you to work with the information using simple, easily understood (even testable) steps.
Your current strategy -- one might call it the slurp and whack with massive regex strategy -- is difficult to understand and maintain (in 3 months will your regex makes immediate sense to you?), difficult to test and debug, and difficult to adjust if you discover unanticipated deviations from your initial understanding of the data. In addition, as you've discovered, the strategy is vulnerable to memory limitations (because of the need to slurp the file).
There are many questions on StackOverflow illustrating how one can parse text when the meaningful units span multiple lines. Also see this question, where I delivered similar advice to another questioner.
Some simple parsing can break the file down into manageable chunks. The algorithm is:
1. Read until you see `+Deskriptoren:`
2. Read everything after that until the next `+Foo:` line
3. Munge that bit.
4. Goto 1.
Here's the sketch of the code:
use strict;
use warnings;
use autodie;
open my $in, $input_file;
open my $out, $output_file;
while(my $line = <$in>) {
# Print out everything you don't modify
# this includes the +Deskriptoren line.
print $out $line;
# When the start of a description block is seen, slurp in up to
# the next section.
if( $line =~ m{^ \Q Deskriptoren: }x ) {
my($section, $next_line) = _read_to_next_section($in);
# Print the modified description
print $out _munge_description($section);
# And the following header line.
print $out $next_line;
}
}
sub _read_to_next_section {
my $in = shift;
my $section = '';
my $line;
while( $line = <$in> ) {
last if $line =~ /^ \+ /x;
$section .= $line;
}
# When reading the last section, there might not be a next line
# resulting in $line begin undefined.
$line = '' if !defined $line;
return($section, $line);
}
# Note, the +Deskriptoren line is not on $description
sub _munge_description {
my $description = shift;
...whatever you want to do to the description...
return $description;
}
I haven't tested it, but something like that should do you. It has the advantage over dealing with the whole file as a string (File::Map or otherwise) that you can deal with each section individually rather than trying to cover every base in one regex. It also will let you develop a more sophisticated parser to deal with things like comments and strings that might mess up the simple parsing above and would be a huge pain to adapt a massive regex to.
You are using mode <, which is read-only. If you want to modify the contents, you need read-write access, so you should be using +<.
If you are on windows, and need binary mode, then you should open the file separately, set binary mode on the file handle, then map from that handle.
I also noticed that you have an input file and an output file. If you use File::Map, you are changing the file in-place... that is, you can't open the file for reading and change the contents of a different file. You would need to copy the file, then modify the copy. I've done so below.
use strict;
use warnings;
use File::Map qw(map_file);
use File::Copy;
copy("input.txt", "output.txt") or die "Cannot copy input.txt to output.txt: $!\n";
open my $fh, '+<', "output.txt"
or die "Cannot open output.txt in r/w mode: $!\n";
binmode($fh);
map_handle my $contents, $fh, '+<';
my $n_changes = ( $contents =~ s/from/to/gm );
unmap($contents);
close($fh);
The documentation for File::Map isn't very good on how errors are signaled, but from the source, it looks as if $contents being undefined would be a good guess.

Perl splitting text string (from HTML page, text document, etc.) by line into array?

This is kind of a weird question, at least for me, as I don't exactly understand what is fully involved in this. Basically, I have been doing this process where I save a scraped document (such as a web page) to a .txt file. Then I can easily use Perl to read this file and put each line into an array. However, it is not doing this based on any visible thing in the document (i.e., it is not going by HTML linebreaks); it just knows where a new line is, based on the .txt format.
However, I would like to cut this process out and just do the same thing from within a variable, so instead I would have what would have been the contents of the .txt file in a string and then I want to parse it, in the same way, line by line. The problem for me is that I don't know much about how this would work as I don't really understand how Perl would be able to tell where a new line is (assuming I'm not going by HTML linebreaks, as often it is just a web based .txt file (which presents to my scraper, www:mechanize, as a web page) I'm scraping so there is no HTML to go by). I figure I can do this using other parameters, such as blank spaces, but am interested to know if there is a way to do this by line. Any info is appreciated.
I'd like to cut the actual saving of a file to reduce issues related to permissions on servers I use and also am just curious if I can make the process more efficient.
Here's an idea that might help you: you can open from strings as well as files.
So if you used to do this:
open( my $io, '<', 'blah.txt' ) or die "Could not open blah.txt! - $!";
my #list = <$io>;
You can just do this:
open( my $io, '<', \$text_I_captured );
my #list = <$io>;
It's hard to tell what your code's doing since we don't have it in front of us; it would be easier to help if you posted what you had. However, I'll give it a shot. If you scrape the text into a variable, you will have a string which may have embedded line breaks. These will either be \n (the traditional Unix newline) or \r\n (the traditional Windows newline sequence). Just as you can split on a space to get (a first approximation of) the words in a sentence, you can instead split on the newline sequence to get the lines in. Thus, the single line you'll need should be
my #lines = split(/\r?\n/, $scraped_text);
Use the $/ variable, this determines what to break lines on. So:
local $/ = " ";
while(<FILE>)...
would give you chunks separated by spaces. Just set it back to "\n" to get back to the way it was - or better yet, go out of the local $/ scope and let the global one come back, just in case it was something other than "\n" to begin with.
You can eliminate it altogether:
local $/ = undef;
To read whole files in one slurp. And then iterate through them however you like. Just be aware that if you do a split or a splice, you may end up copying the string over and over, using lots of CPU and lots of memory. One way to do it with less is:
# perl -de 0
> $_="foo\nbar\nbaz\n";
> while( /\G([^\n]*)\n/go ) { print "line='$1'\n"; }
line='foo'
line='bar'
line='baz'
If you're breaking apart things by newlines, for example. \G matches either the beginning of the string or the end of the last match, within a /g-tagged regex.
Another weird tidbit is $/=\10... if you give it a scalar reference to an integer (here 10), you can get record-length chunks:
# cat fff
eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun
# perl -de 0
$/ = \10;
open FILE, "<fff";
while(<FILE>){ print "chunk='$_'\n"; }
chunk='eurgpuwerg'
chunk='piuewrngpi'
chunk='euwngipuen'
chunk='rgpiunergp'
chunk='iunerpigun'
chunk='
'
More info: http://www.perl.com/pub/a/2004/06/18/variables.html
If you combine this with FM's answer of using:
$data = "eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun";
open STRING, "<", \$data;
while(<STRING>){ print "chunk='$_'\n"; }
I think you can get every combination of what you need...