What does this line in DBI.pm do? - regex

603 $dsn =~ s/^dbi:(\w*?)(?:\((.*?)\))?://i
or '' =~ /()/; # ensure $1 etc are empty if match fails
I don't understand what $dsn =~ s/^dbi:(\w*?)(?:\((.*?)\))?://i is for,even more doubt about '' =~ /()/,seems useless to me..

The first part is extracting two parts of the dsn string in the form:
dbi: first match ( optional second match ) :
These matches will be placed into $1 and $2 for the use in later code. The second part will only run if the match was unsuccessful. This is achieved by using or which will short-circuit (i.e. not execute) the second expression if the first one was successful.
As the comment says quite succinctly, it ensures that $1, $2, etc. are empty. Presumably so later code can check them and produce an appropriate error if they were not set (i.e. could not be extracted from the dsn string).

Equals-tilde, or =~, is the match operator.
Try the following code -- put it in a file, make executable with chmod +x, and run it:
#!/usr/bin/perl
$mystring = "Perl rocks.";
if ($mystring =~ /rocks/) {
print("Matches");
} else {
print("No match");
}
It will output Matches.
As for your example, it checks if the connection string is in the correct format, and extracts the database name, etc:
print($dsn);
$dsn = "dbi:SQLPlatform:database_name:host_name:port";
$dsn =~ s/^dbi:(\w*?)(?:\((.*?)\))?://i
or '' =~ /()/; # ensure $1 etc are empty if match fails
print($dsn);
Ouptuts database_name:host_name:port.

It's clear from the comments in the code:
602 # extract dbi:driver prefix from $dsn into $1
603 $dsn =~ s/^dbi:(\w*?)(?:\((.*?)\))?://i
604 or '' =~ /()/; # ensure $1 etc are empty if match fails
If you have problems understanding how s// and m// work see perlop and perlre.

If a capturing match fails $1 may still contain a value; the value of the last successful matching capture in the same dynamic scope, possibly from some other previous regexp. It appears the author didn't want a failed match at this point to leave some value in $1 from a previous regexp. To prevent this, he forced a "will always succeed" capturing match with nothing specified within the capturing parens. That means that there will be a match, and a capture of the empty string. In other words, $1 will now be empty rather than containing the match value from some previous successful match.
A more common idiom is simply to test for match success before executing whatever code will rely on $1's value, as in:
if( /(match)/ ) {
say $1;
}
While that's often the simplest approach, unfortunately code sometimes is not simple, and forcing that test into some complex code may make a tricky section even harder to deal with. That being the case, it may just be easier to ensure that $1 contains nothing after a failed match, rather than what it contained before the failed match.
I actually think that's a good question. Finding documentation of the behavior of #$1 after a failed match isn't easy within the Perl POD. I believe a more thorough explanation is found either in the camel book or the llama book. But I don't have them at my fingertips right now to check.

What is left out of the answers so far is the reason for that mysterious or '' =~ /()/. Without that bit of trickiness, $1 will be undefined if the match fails. The code is probably using $1 in a concatenation or a string shortly after this match. Doing this with $1 undefined will result in a "Use of uninitialized value $1 in concatenation (.) or string" warning if use warnings is in effect. With that or '' =~ /()/ trickiness in play, $1 will be defined (but empty) should the regular expression fail to match. This keeps that code that uses $1 from spewing.
The comment # ensure $1 etc are empty if match fails is incorrect. Get rid of that 'etc' and the comment is correct. This action sets $1, and $1 only. This code does not set $2. $2 will be undefined if the regular expression does not match.

Related

print the matched word in perl regex

I need to print all my matched strings from a stored line in perl. I have seen various posts on this
Print the matched string using perl
Perl Regex - Print the matched value
and I experimented to first try to print the first word. But I get a build error
Use of uninitialized value $1 in concatenation (.) or string at rg.pl line 10.
I have tried with split and arrays and it works, but while printing $1, it throws error.
My code is here
#!/usr/bin/perl/
use warnings;
use strict;
#my $line = "At a far distance near the bar, was a parked car. Star were shining in the night. The boy in the car had scar and he was at war with his enemy. \n";
my $line = "At a far distance near the bar, was a parked car. \n";
if($line =~ /[a-z]ar/gi)
{ print "$1 \n"; }
$_ = $line;
I want my output for this code to be
far
and subsequently print all the words containing ar,
far
near
bar
parked
car
I even tried changing my code, as below but that didnt work, same error
if($line =~ /[a-z]ar/gi) {
my $match = $1;
print "$match \n"; }
First, you didn't capture anything, which is how $n variables are populated. Put parenthesis around what you want to be captured into $1
if ($line =~ /([a-z]ar)i/) { print "$1\n" }
I've removed the /g which is unneeded (and with potential for trouble†) here.
Next, your pattern requires and captures one letter followed by literal ar, no more no less. That won't capture near, nor will it capture parked (it'll get par only). It will not even match a word that starts with ar, since it requires that there is a letter before ar. You need to use quantifiers, to tell it how many times to match a letter. And you also want to find all matches.
One way is to scoop them all up by providing the list context and /g (global) modifier
my #words = $line =~ /([a-z]*ar[a-z]*)/gi;
print "$_\n" for #words;
The [a-z]* means to match a letter, zero-or-more times. So an optional string of letters. We also added an optional string of letters after ar. The /g makes it continue through the string after a match, to find all such patterns. In the list context the list of matches is returned.
Or, you can match in scalar context like in the first example, but in a while loop
while ($line =~ /([a-z]*ar[a-z]*)/gi) { print "$1\n" }
Here /g does something different. It matches a pattern once and returns true, the while condition is true and we print. Then it comes back and looks for a match from where it matched previously ... and keeps doing this until there are no more matches.
This is complex behavior altogether. From Regexp Quote-Like Operators in perlop
The /g modifier specifies global pattern matching--that is, matching as many times as possible within the string. How it behaves depends on the context. In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.
In scalar context, each execution of m//g finds the next match, returning true if it matches, and false if there is no further match. [...]
Read about this in more detail and in a tutorial manner in perlretut, under "Global matching."
† Note on using /g modifier in scalar context
I've used that above, in while (/.../g), what is a very common way to hop over all occurrences of the pattern in a string, each time giving us control in the while body.
While this use is intended and idiomatic, the use of /g in scalar context can bring subtle trouble when not in the loop condition: the next regex with /g on this variable will continue from the previous match, not from the string's beginning, what may be unexpected.
That "next regex" may also simply be that same expression -- in the next pass of some larger loop in which our expression happens to be, and this holds across function calls as well. Consider
use warnings;
use strict;
use feature 'say';
my $s = q(one two three);
sub func { say $1 if $_[0] =~ /(\w+)/g }; # /g may be of great consequence!
for (1..4) {
# ... perhaps much, much later ...
func($s);
}
This loop prints lines one, then two, then three, and that's that. This (working) example is so bare bones that it is artificial bit I hope that it conveys that /g in scalar context may surprise.
For one thing, it is not uncommon to see /g on a regex in an if condition being plain wrong.
For multiple matches, use a while loop. Also, I surrounded the quantity you want to capture with parentheses to indicate that it is a capture group.
while ($line =~ /([a-z]*ar[a-z]*)/gi ) {
print "$1 \n";
}

Perl: how to use string variables as search pattern and replacement in regex

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,
$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e
But when I moved the pattern and replacement to a variable, $1 was not evaluated.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e
When I use "ee" modifier, it gives errors.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
(Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
(Missing operator before _?)
aeae
What do I miss here?
Edit
Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later.
Here are some examples of $p and $r.
^(.*\D)?((19|18|20)\d\d)年 $1$2<digits>年
^(.*\D)?(0\d)年 $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/]) $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d]) $1$3分之$2$4
With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.
This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.
As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).
The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need
'$1_$2$3other' --> $1 . '_' . $2 . $3 . 'other'
which is valid Perl code that can be evaluated.
The part of breaking this up is helped by split's capturing in the separator pattern.
sub repl {
my ($r) = #_;
my #terms = grep { $_ } split /(\$\d)/, $r;
return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } #terms;
}
$var =~ s/$p/repl($r)/gee;
With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.
Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.
Then /ee will have this function return the string (such as above), and evaluate it as valid code.
We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge
For more discussion of /ee see this post, with several useful answers.
Note on using "b(.)d" for a regex pattern
In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

Perl match only returning "1". Booleans? Why?

This has got to be obvious but I'm just not seeing it.
I have a documents containing thousands of records just like below:
Row:1 DATA:
[0]37755442
[1]DDG00000010
[2]FALLS
[3]IMAGE
[4]Defect
[5]3
[6]CLOSED
I've managed to get each record separated and I'm now trying to parse out each field.
I'm trying to match the numbered headers so that I can pull out the data that succeeds them but the problem is that my matches are only returning me "1" when they succeed and nothing if they don't. This is happening for any match I try to apply.
For instance, applied to a simple word within each record:
my($foo) = $record=~ /Defect/;
print STDOUT $foo;
prints out out a "1" for each record if it contains "Defect" and nothing if it contains something else.
Alternatively:
$record =~ /Defect/;
print STDOUT $1;
prints absolutely nothing.
$record =~ s/Defect/Blefect/
will replace "Defect" with "Blefect" perfectly fine on the other hand.
I'm really confused as to why the returns on my matches are so screwy.
Any help would be much appreciated.
You need to use capturing parentheses to actually capture:
if ($record =~ /(Defect)/ ) {
print "$1\n";
}
I think what you really want is to wrap the regex in parentheses:
my($foo) = $record=~ /(Defect)/;
In list context, the groups are returned, not the match itself. And your original code has no groups.
The =~ perl operator takes a string (left operand) and a regular expression (right operand) and matches the string against the RE, returning a boolean value (true or false) depending on whether the re matches.
Now perl doesn't really have a boolean type -- instead every value (of any type) is treated as either 'true' or 'false' when in a boolean context -- most things are 'true', but the empty string and the special 'undef' value for undefined things are false. So when returning a boolean, it generall uses '1' for true and '' (empty string) for false.
Now as to your last question, where trying to print $1 prints nothing. Whenever you match a regular expression, perl sets $1, $2 ... to the values of parenthesized subexpressions withing the RE. In your example however, there are NO parenthesized sub expressions, so $1 is always empty. If you change it to
$record =~ /(Defect)/;
print STDOUT $1;
You'll get something more like what you expect (Defect if it matches and nothing if it doesn't).
The most common idiom for regexp matching I generally see is something like:
if ($string =~ /regexp with () subexpressions/) {
... code that uses $1 etc for the subexpressions matched
} else {
... code for when the expression doesn't match at all
}
From perlop, Quote and Quote-Like operators [bits in brackets added by me]:
/PATTERN/msixpodualgc
Searches a string for a pattern match, and in scalar context returns true [1] if it succeeds, false [undef] if it fails.
(Looking at the section on s/// will also be useful ;-)
Perl just doesn't have a discreet boolean type or true/false aliases so 1 and undef are often used: however, it could very well could be other values without making the documentation incorrect.
$1 will never be defined because there is no capture group: perhaps $& (aka $MATCH) is desired? (Or better, change the regular expression to have a capture group ;-)
Happy coding.
my($foo) = $record=~ /Defect/;
print STDOUT $foo;
Rather than this you should do
$record =~ /Defect/;
my $foo = $&; # Matched portion of the $record.
As your goal seems to be to get the matched portion.
The return value is true/false indicating if match was successful or not.
You may find http://perldoc.perl.org/perlreref.html handy.
If you want the result of a match as "true" or "false", then do the pattern match in scalar context. That's what you did in your first example. You performed a pattern match and assigned the result to the scalar my($foo). So $foo got a "true" or "false" value.
But if you want to capture the text that matched a part of your pattern, use grouping parentheses and then check the corresponding $ variable. For example, consider the expression:
$record =~ /(.*)ing/
A match on the word "speaking" will assign "speak" to $1, "listening" will assign "listen" to $1, etc. That's what you are trying to do in your second example. The trouble is that you need to add in the grouping parentheses. "$record =~ /Defect/" will assign nothing to $1 because there are no grouping parentheses in the pattern.

How to have a variable as regex in Perl

I think this question is repeated, but searching wasn't helpful for me.
my $pattern = "javascript:window.open\('([^']+)'\);";
$mech->content =~ m/($pattern)/;
print $1;
I want to have an external $pattern in the regular expression. How can I do this? The current one returns:
Use of uninitialized value $1 in print at main.pm line 20.
$1 was empty, so the match did not succeed. I'll make up a constant string in my example of which I know that it will match the pattern.
Declare your regular expression with qr, not as a simple string. Also, you're capturing twice, once in $pattern for the open call's parentheses, once in the m operator for the whole thing, therefore you get two results. Instead of $1, $2 etc. I prefer to assign the results to an array.
my $pattern = qr"javascript:window.open\('([^']+)'\);";
my $content = "javascript:window.open('something');";
my #results = $content =~ m/($pattern)/;
# expression return array
# (
# q{javascript:window.open('something');'},
# 'something'
# )
When I compile that string into a regex, like so:
my $pattern = "javascript:window.open\('([^']+)'\);";
my $regex = qr/$pattern/;
I get just what I think I should get, following regex:
(?-xism:javascript:window.open('([^']+)');)/
Notice that it it is looking for a capture group and not an open paren at the end of 'open'. And in that capture group, the first thing it expects is a single quote. So it will match
javascript:window.open'fum';
but not
javascript:window.open('fum');
One thing you have to learn, is that in Perl, "\(" is the same thing as "(" you're just telling Perl that you want a literal '(' in the string. In order to get lasting escapes, you need to double them.
my $pattern = "javascript:window.open\\('([^']+)'\\);";
my $regex = qr/$pattern/;
Actually preserves the literal ( and yields:
(?-xism:javascript:window.open\('([^']+)'\);)
Which is what I think you want.
As for your question, you should always test the results of a match before using it.
if ( $mech->content =~ m/($pattern)/ ) {
print $1;
}
makes much more sense. And if you want to see it regardless, then it's already implicit in that idea that it might not have a value. i.e., you might not have matched anything. In that case it's best to put alternatives
$mech->content =~ m/($pattern)/;
print $1 || 'UNDEF!';
However, I prefer to grab my captures in the same statement, like so:
my ( $open_arg ) = $mech->content =~ m/($pattern)/;
print $open_arg || 'UNDEF!';
The parens around $open_arg puts the match into a "list context" and returns the captures in a list. Here I'm only expecting one value, so that's all I'm providing for.
Finally, one of the root causes of your problems is that you do not need to specify your expression in a string in order for your regex to be "portable". You can get perl to pre-compile your expression. That way, you only care what instructions the characters are to a regex and not whether or not you'll save your escapes until it is compiled into an expression.
A compiled regex will interpolate itself into other regexes properly. Thus, you get a portable expression that interpolates just as well as a string--and specifically correctly handles instructions that could be lost in a string.
my $pattern = qr/javascript:window.open\('([^']+)'\);/;
Is all that you need. Then you can use it, just as you did. Although, putting parens around the whole thing, would return the whole matched expression (and not just what's between the quotes).
You do not need the parentheses in the match pattern. It will match the whole pattern and return that as $1, which I am guess is not matching, but I am only guessing.
$mech->content =~ m/$pattern/;
or
$mech->content =~ m/(?:$pattern)/;
These are the clustering, non-capturing parentheses.
The way you are doing it is correct.
The solutions have been already given, I'd like to point out that the window.open call might have multiple parameters included in "" and grouped by comma like:
javascript:window.open("http://www.javascript-coder.com","mywindow","status=1,toolbar=1");
There might be spaces between the function name and parentheses, so I'd use a slighty different regex for that:
my $pattern = qr{
javascript:window.open\s*
\(
([^)]+)
\)
}x;
print $1 if $text =~ /$pattern/;
Now you have all parameters in $1 and can process them afterwards with split /,/, $stuff and so on.
It reports an uninitialized value because $1 is undefined. $1 is undefined because you have created a nested matching group by wrapping a second set of parentheses around the pattern. It will also be undefined if nothing matches your pattern.

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.