Using foreach instead of map and grep in perl - regex

my #a_columns = map { s/^"|"$|\n|\r|\n\r|"//g; $_ } split /;/, $s_act_line;
above mentioned is my code. i am getting automatic warning while merging this code because of the usage of the map function.
please help me to convert this using for or foreach or any loop.
I have already tried couple of ways but nothing works for me.
please help me guys

The warning appears to follow Perl::Critic (even though it seems to be issued by your IDE).
The variable $_ in map, but also in grep and foreach, is an alias for the currently processed array element. So once it gets changed the input array gets changed! This is usually unwanted, and can surely be considered a tricky practice in general, thus the warning.
But in this case the input for map is the list produced by split, used on your variable. So code in map's block cannot change your variables by changing $_. Then it is safe to tell Perl::Ciritic to ignore this statement
my #a_columns =
map { s/^"|"$|\n|\r|\n\r|"//g; $_ } split /;/, $s_act_line; ## no critic
or for more than one statement
## no critic
... code that Perl::Critic should ignore
## use critic
If the warning is indeed produced by Perl::Critic (and not by IDE) this should stop it.
Better yet, the idiom used in map's block is unneeded with Perl versions from 5.14, since we got the non-destructive substitution. With it s///r returns the changed string, leaving the original unchanged. So you can do
my #a_columns = map { s/^"|"$|\n|\r|\n\r|"//gr } split /;/, $s_act_line;
which is also much cleaner and safer. Now Perl::Critic should not flag this.

The problem is that you're modifying $_ in the map, which means you're modifying the values passed to map. That can lead to surprises. Switching to for wouldn't help.
It's actually harmless in this case since split returns temporary values, but the following are "clean" alternatives:
my #a_columns =
map { ( my $tmp = $_ ) =~ s/^"|"$|\n|\r|\n\r|"//g; $tmp }
split /;/, $s_act_line;
use List::MoreUtils qw( apply );
my #a_columns =
apply { s/^"|"$|\n|\r|\n\r|"//g; }
split /;/, $s_act_line;
my #a_columns =
map { s/^"|"$|\n|\r|\n\r|"//rg }
split /;/, $s_act_line;

Related

Multiple grep (with regex's) functions not working in Perl script

Having trouble with a script right now.
Trying to filter out portions of a file and put them into a scalar. Here is the code.
#value = (grep {m/(III[ABC])/g and m//g }<$fh>)
print #value;
#value = (grep { m/[012]iii/g}(<$fh>));
print #value;
When I run the first grep , the values appear in the print statment. But when I run the second grep. The 2nd print statement doesnt print anything. Does adding a second grep, cancel out the effectiveness of the first grep ?
I know that first and second grep work because even when I commented out the first grep. The second grep function worked.
All I really want to do, is filter out information, for multiple different individual arrays. I am really confused as to how to fix this problem, since I am planning on adding more grep's to the script.
The first read on <$fh> gets to the end of the file. Then the second invocation has nothing to read. Thus if you comment out the first one this doesn't happen and the second one works.
The code below adds to the same array. Change to the commented out code if needed. The regex is simplified, since it requires a comment while it doesn't affect the actual question. Please put it back the way it was, if that was what you really meant.
You can either rewind the filehandle after all lines have been read
my #vals = grep { /III[ABC]/ } <$fh>;
seek $fh, 0, 0;
# ready for reading again from the beginning
push #vals, grep { /[012]iii/ } <$fh>;
#or: my #vals_2 = grep { /[012]iii/ } <$fh>;
Or you can read all lines into an array that you can then process repeatedly.
my #original = <$fh>;
my #vals = grep { /III[ABC]/ } #original;
push #vals, grep { m/[012]iii/ } #original;
# or assign to a different array
If you don't need to store these results in such order it would be far more efficient to read the file line by line, and process and add as you go.
Update
I simplified the originally posted regex in order to focus on the question at hand, since the exact condition inside the block has no bearing on it. See the Note below. Thanks to ikegami for bringing it up and for explaining that // "repeats the last successful query".
The m//g is tricky and I removed it.
grep checks a condition and passes a line through if the condition evaluates true. In such scalar context /.../g modifier has effects which are a very different story, removed.
For the same reason as above, the capturing () is unneeded (excessive).
Cleaning up the syntax helps readability here, removed m/.
Note on regex
In scalar context /.../g modifier does the following, per perlrequick:
successive matches against a string will have //g jump from match to match
The empty string pattern m//g also has effects which are far from obvious, stated above.
Taken together these produce non-trivial results in my tests and need mental tracing to understand. I removed them from the code here since leaving them begs a question on whether they are intended trickery or subtle bugs, thus distracting from the actual question -- which they do not affect at all.
I don't know what you think the g modifier does, but it makes no sense here.
I don't know what you think // (a match with an empty pattern) does, but it makes no sense here.
In list context, <$fh> returns all remaining lines in the file. It returns nothing the second time you evaluate it since since you've already reached the end of the file the first time you evaluated it.
Fix:
my #lines = <$fh>;
my #values1 = grep { /III[ABC]/ && /.../ } #lines;
my #values2 = grep { /[012]iii/ } #lines;
Of course, substitute ... for what you meant to use there.

Tcl regexp cache with lists of RE

I read that Tcl caches the last 30 regexp compiled and also that assigning a variable to the RE in string version will make Tcl attach the compiled RE to the variable the first time it is used. But what I can't seem to find is if that compiled RE caching will still be done if the RE are contained in a list and iterated upon.
Basically, imagine I have this :
set REs {
"RE 1"
"RE 2"
.
.
.
"RE 39"
"RE 40"
}
foreach re $REs {
if { [regexp -nocase $re $line] } {
AchieveWorldPeace $line
}
}
Since those REs are used over and over and since I have more than 30 REs (and I don't want to recompile Tcl after changing the corresponding #define based solely on that script), the caching becomes important for the script to run at its fastest. My question is therefore : in this example, would the regular expression be recompiled at each loop? If yes, is there a way to ensure caching when using lists of regular expressions?
Basically, is there a way for the caching to be attached to the Tcl_Object pointed to by the list and not to the Tcl_Object pointed to by the iterator in the foreach ? (Note : that question might be wrong on multiple levels because I don't have any experience in terms of Tcl source code, but it's how I imagined the whole thing to be implemented.)
Please note that this question is more oriented on a better understanding of Tcl than on a specific code answer.
Also, I know I can do something like this :
set RE "(RE 1|RE 2| ... |RE 39|RE 40)"
if { [regexp -nocase $RE $line] } {
AchieveWorldPeace $line
}
And, from my tests, I know that this speeds up my script by about a factor of two (which is not bad considering the script does a lot more). However, there is no way to tell easily which RE was matched when implemented this way, so it's not quite the same. (Not critical in my case, but just saying...)
Tcl uses two caches of RE compilations. One is the per-thread cache, and the other is in the Tcl_Obj internal representation of the RE. Since the values in a list retain their internal representations, the foreach of a list will keep them as well: your example code will be perfectly well cached with no need for further special action by you. Easy!

TCL: Backslash issue (regsub)

I have an issue while trying to read a member of a list like \\server\directory
The issue comes when I try to get this variable using the lindex command, that proceeds with TCL substitution, so the result is:
\serverdirectory
Then, I think I need to use a regsub command to avoid the backslash substitution, but I did not get the correct proceedure.
An example of what I want should be:
set mistring "\\server\directory"
regsub [appropriate regular expresion here]
puts "mistring: '$mistring'" ==> "mistring: '\\server\directory'"
I have checked some posts around this, and keep the \\ is ok, but I still have problems when trying to keep always a single \ followed by any other character that could come here.
UPDATE: specific example. What I am actually trying to keep is the initial format of an element in a list. The list is received by an outer application. The original code is something like this:
set mytable $__outer_list_received
puts "Table: '$mytable'"
for { set i 0 } { $i < [llength $mitabla] } { incr i } {
set row [lindex $mytable $i]
puts "Row: '$row'"
set elements [lindex $row 0]
puts "Elements: '$elements'"
}
The output of this, in this case is:
Table: '{{
address \\server\directory
filename foo.bar
}}'
Row: '{
address \\server\directory
filename foo.bar
}'
Elements: '
address \\server\directory
filename foo.bar
'
So I try to get the value of address (in this specific case, \\server\directory) in order to write it in a configuration file, keeping the original format and data.
I hope this clarify the problem.
If you don't want substitutions, put the problematic string inside curly braces.
% puts "\\server\directory"
\serverdirectory
and it's not what you want. But
% puts {\\server\directory}
\\server\directory
as you need.
Since this is fundamentally a problem on Windows (and Tcl always treats backslashes in double-quotes as instructions to perform escaping substitutions) you should consider a different approach (otherwise you've got the problem that the backslashes are gone by the time you can apply code to “fix” them). Luckily, you've got two alternatives. The first is to put the string in {braces} to disable substitutions, just like a C# verbatim string literal (but that uses #"this" instead). The second is perhaps more suitable:
set mistring [file nativename "//server/directory"]
That ensures that the platform native directory separator is used on Windows (and nowadays does nothing on other platforms; back when old MacOS9 was supported it was much more magical). Normally, you only need this sort of thing if you are displaying full pathnames to users (usually a bad idea, GUI-wise) or if you are passing the name to some API that doesn't like forward slashes (notably when going as an argument to a program via exec but there are other places where the details leak through, such as if you're using the dde, tcom or twapi packages).
A third, although ugly, option is to double the slashes. \\ instead of \, and \ instead of \, while using double quotes. When the substitution occurs it should give you what you want. Of course, this will not help much if you do the substitution a second time.

How do I check if a scalar has a compiled regex in it with Perl?

Let's say I have a subroutine/method that a user can call to test some data that (as an example) might look like this:
sub test_output {
my ($self, $test) = #_;
my $output = $self->long_process_to_get_data();
if ($output =~ /\Q$test/) {
$self->assert_something();
}
else {
$self->do_something_else();
}
}
Normally, $test is a string, which we're looking for anywhere in the output. This was an interface put together to make calling it very easy. However, we've found that sometimes, a straight string is problematic - for example, a large, possibly varying number of spaces...a pattern, if you will. Thus, I'd like to let them pass in a regex as an option. I could just do:
$output =~ $test
if I could assume that it's always a regex, but ah, but the backwards compatibility! If they pass in a string, it still needs to test it like a raw string.
So in that case, I'll need to test to see if $test is a regex. Is there any good facility for detecting whether or not a scalar has a compiled regex in it?
As hobbs points out, if you're sure that you'll be on 5.10 or later, you can use the built-in check:
use 5.010;
use re qw(is_regexp);
if (is_regexp($pattern)) {
say "It's a regex";
} else {
say "Not a regex";
}
However, I don't always have that option. In general, I do this by checking against a prototype value with ref:
if( ref $scalar eq ref qr// ) { ... }
One of the reasons I started doing it this way was that I could never remember the type name for a regex reference. I can't even remember it now. It's not uppercase like the rest of them, either, because it's really one of the packages implemented in the perl source code (in regcomp.c if you care to see it).
If you have to do that a lot, you can make that prototype value a constant using your favorite constant creator:
use constant REGEX_TYPE => ref qr//;
I talk about this at length in Effective Perl Programming as "Item 59: Compare values to prototypes".
If you want to try it both ways, you can use a version check on perl:
if( $] < 5.010 ) { warn "upgrade now!\n"; ... do it my way ... }
else { ... use is_regex ... }
As of perl 5.10.0 there's a direct, non-tricky way to do this:
use 5.010;
use re qw(is_regexp);
if (is_regexp($pattern)) {
say "It's a regex";
} else {
say "Not a regex";
}
is_regexp uses the same internal test that perl uses, which means that unlike ref, it won't be fooled if, for some strange reason, you decide to bless a regex object into a class other than Regexp (yes, that's possible).
In the future (or right now, if you can ship code with a 5.10.0 requirement) this should be considered the standard answer to the problem. Not only because it avoids a tricky edge-case, but also because it has the advantage of saying exactly what it means. Expressive code is a good thing.
See the ref built-in.

What's a good Perl regex to untaint an absolute path?

Well, I tried and failed so, here I am again.
I need to match my abs path pattern.
/public_html/mystuff/10000001/001/10/01.cnt
I am in taint mode etc..
#!/usr/bin/perl -Tw
use CGI::Carp qw(fatalsToBrowser);
use strict;
use warnings;
$ENV{PATH} = "bin:/usr/bin";
delete ($ENV{qw(IFS CDPATH BASH_ENV ENV)});
I need to open the same file a couple times or more and taint forces me to untaint the file name every time. Although I may be doing something else wrong, I still need help constructing this pattern for future reference.
my $file = "$var[5]";
if ($file =~ /(\w{1}[\w-\/]*)/) {
$under = "/$1\.cnt";
} else {
ErroR();
}
You can see by my beginner attempt that I am close to clueless.
I had to add the forward slash and extension to $1 due to my poorly constructed, but working, regex.
So, I need help learning how to fix my expression so $1 represents /public_html/mystuff/10000001/001/10/01.cnt
Could someone hold my hand here and show me how to make:
$file =~ /(\w{1}[\w-\/]*)/ match my absolute path /public_html/mystuff/10000001/001/10/01.cnt ?
Thanks for any assistance.
Edit: Using $ in the pattern (as I did before) is not advisable here because it can match \n at the end of the filename. Use \z instead because it unambiguously matches the end of the string.
Be as specific as possible in what you are matching:
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
if ( $fn =~ m!
^(
/public_html
/mystuff
/[0-9]{8}
/[0-9]{3}
/[0-9]{2}
/[0-9]{2}\.cnt
)\z!x ) {
print $1, "\n";
}
Alternatively, you can reduce the vertical space taken by the code by putting the what I assume to be a common prefix '/public_html/mystuff' in a variable and combining various components in a qr// construct (see perldoc perlop) and then use the conditional operator ?::
#!/usr/bin/perl
use strict;
use warnings;
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
my $prefix = '/public_html/mystuff';
my $re = qr!^($prefix/[0-9]{8}/[0-9]{3}/[0-9]{2}/[0-9]{2}\.cnt)\z!;
$fn = $fn =~ $re ? $1 : undef;
die "Filename did not match the requirements" unless defined $fn;
print $fn, "\n";
Also, I cannot reconcile using a relative path as you do in
$ENV{PATH} = "bin:/usr/bin";
with using taint mode. Did you mean
$ENV{PATH} = "/bin:/usr/bin";
You talk about untainting the file path every time. That's probably because you aren't compartmentalizing your program steps.
In general, I break up these sort of programs into stages. One of the earlier stages is data validation. Before I let the program continue, I validate all the data that I can. If any of it doesn't fit what I expect, I don't let the program continue. I don't want to get half-way through something important (like inserting stuff into a database) only to discover something is wrong.
So, when you get the data, untaint all of it and store the values in a new data structure. Don't use the original data or the CGI functions after that. The CGI module is just there to hand data to your program. After that, the rest of the program should know as little about CGI as possible.
I don't know what you are doing, but it's almost always a design smell to take actual filenames as input.