Perl pattern match variable question - regex

I'm trying to open a file, match a particular line, and then wrap HTML tags around that line. Seems terribly simple but apparently I'm missing something and don't understand the Perl matched pattern variables correctly.
I'm matching the line with this:
$line =~ m/(Number of items:.*)/i;
Which puts the entire line into $1. I try to then print out my new line like this:
print "<p>" . $1 . "<\/p>;
I expect it to print this:
<p>Number of items: 22</p>
However, I'm actually getting this:
</p>umber of items: 22
I've tried all kinds of variations - printing each bit on a separate line, setting $1 to a new variable, using $+ and $&, etc. and I always get the same result.
What am I missing?

You have an \r in your match, which when printed results in the malformed output.
edit:
To explain further, chances are your file has windows style \r\n line endings. chomp won't remove the \r, which will then get slurped into your greedy match, and results in the unpleasant output (\r means go back to the start of the line and continue printing).
You can remove the \r by adding something like
$line =~ tr/\015//d;

Can you provide a complete code snippet that demonstrates your problem? I'm not seeing it.
One thing to be cautious of is that $1 and friends refer to captures from the last successful match in that dynamic scope. You should always verify that a match succeeds before using one:
$line = "Foo Number of items: 97\n";
if ( $line =~ m/(Number of items:.*)/i ) {
print "<p>" . $1 . "<\/p>\n";
}

You've just learned (for future reference) how dangerous .* can be.
Having banged my head against similar unpleasantnesses, these days I like to be as precise as I can about what I expect to capture. Maybe
$line =~ m/(Number of items:\s+\d+)/;
Then I'm sure of not capturing the offending control character in the first place. Whatever Cygwin may be doing with Windows files, I can remain blissfully ignorant.

Related

Make a regular expression in perl to grep value work on a string with different endings

I have this code in perl where I want to extract the value of 'EUR_AF', in this case '0.39'.
Sometimes 'EUR_AF' ends with ';', sometimes it doesn't.
Alternatively, 'EUR_AF' may end with '=0' instead of '=0.39;' or '=0.39'.
How do I make the code handle that? Can't seem to find it online...I could of course wrap everything in an almost endless if-elsif-else statement, but that seems overkill.
Example text:
AVGPOST=0.9092;AN=2184;RSQ=0.5988;ERATE=0.0081;AC=144;VT=SNP;THETA=0.0045;AA=A;SNPSOURCE=LOWCOV;LDAF=0.0959;AF=0.07;ASN_AF=0.05;AMR_AF=0.10;AFR_AF=0.11;EUR_AF=0.039
Code: $INFO =~ m/\;EUR\_AF\=(.*?)(;)/
I did find that: $INFO =~ m/\;EUR\_AF\=(.*?0)/ handles the cases of EUR_AF=0, but how to handle alternative scenarios efficiently?
Extract one value:
my ($eur_af) = $s =~ /(?:^|;)EUR_AF=([^;]*)/;
my ($eur_af) = ";$s" =~ /;EUR_AF=([^;]*)/;
Extract all values:
my %rec = split(/[=;]/, $s);
my $eur_af = $rec{EUR_AF};
This regex should work for you: (?<=EUR_AF=)\d+(\.\d+)?
It means
(?<=EUR_AF=) - look for a string preceeded by EUR_AF=
\d+(\.\d+)? - consist of a digit, optionally a decimal digit
EDIT: I originally wanted the whole regex to return the correct result, not only the capture group. If you want the correct capture group edit it to (?<=EUR_AF=)(\d+(?:\.\d+)?)
I have found the answer. The code:
$INFO =~ m/(?:^|;)EUR_AF=([^;]*)/
seems to handle the cases where EUR_AF=0 and EUR_AF=0.39, ending with or without ;. The resulting $INFO will be 0 or 0.39.

perl regex remove newlines in string

I have a Perl script which runs over a database dump in a plain text file, trying to remove all instances of newlines and possibly other odd characters when I see strings between quotes:
INSERT INTO ... VALUES ( "... these are the lines I'm interested in." )
I slurp in the file:
#file = <FILE>;
and:
foreach my $line (#file) {
$line =~ s/"[^"]*(\R)+[^"]*"//g;
# I want to get rid of newlines in strings
# And other odd characters I might come across
}
One character class I used instead of (\R) was:
([\r\n\t\v\f]+)
and I would try to:
$line =~ s/"[^"]+?([\r\n\t\v\f]+)[^"]*"//g;
I'm sure I'm missing something. I try to start matching with a literal double quote, scan past anything not a double quote (non-greedy, at least one match), reach the characters I want to get rid of, and keep scanning not double quote (any number of other characters not a double quote) until I reach the ending double quote.
So I wanted to replace $1 capture above with nothing.
I've tried on-line regex builders, and
/"[^"]*?([\r\n\t\f\v]+)[^"]*"/
worked with an on-line test, using a short paragraph with newlines and tabs in it, although it was in PHP pcre mode. I thought it would have worked with Perl.
Perhaps I'm not escaping some characters properly in the regex for Perl? Or the pattern is just not going to work the way I want it to, because it's wrong.
Thank you, any help appreciated.
The regex at regex101.com:
"[^"]*?([\r\n\f\t\v]+)[^"]*?"
matches for strings like this:
"This is
my\t test
string.
So there!"
I'm thoroughly puzzled now. :)
The real problem is that you will only find one group of \R's when there could be many groups between quotes. The best thing to do is make a callback (eval) with a general match between quotes, then substitute the \R's in
the replacement.
something like:
sub repl {
my ($content) = _#;
$content =~ s/\R+//g;
return $content;
}
$input =~ s/"([^"]*)"/ repl($1) /ge;
edit: If you're looking for only 1 linebreak cluster, you have to
exclude linebreaks leading up to it. For example: [^"\r\n]+
edit2: To slurp the file into $input, do a
$/ = undef;
my $input = <$fh>;

QRegex look ahead/look behind

I have been pondering on this for quite awhile and still can't figure it out. The regex look ahead/behinds. Anyway, I'm not sure which to use in my situation, I am still having trouble grasping the concept. Let me give you an example.
I have a name....My Business,LLC (Milwaukee,WI)_12345678_12345678
What I want to do is if there is a comma in the name, no matter how many, remove it. At the same time, if there is not a comma in the name, still read the line. The one-liner I have is listed below.
s/(.*?)(_)(\d+_)(\d+$)/$1$2$3$4/gi;
I want to remove any comma from $1(My Business,LLC (Milwaukee,WI)). I could call out the comma in regex as a literal string((.?),(.?),(.*?)(_.*?$)) if it was this EXACT situation everytime, however it is not.
I want it to omit commas and match 'My Business, LLC_12345678_12345678' or just 'My Business_12345678_12345678', even though there is no comma.
In any situation I want it to match the line, comma or not, and remove any commas(if any) no matter how many or where.
If someone can help me understand this concept, it will be a breakthrough!!
Use the /e modifier of Perl so that you can pass your function during the replace in s///
$str = 'My Business,LLC (Milwaukee,WI)_12345678_12345678';
## modified your regex as well using lookahead
$str =~ s/(.*?)(?=_\d+_\d+$)/funct($1)/ge;
print $str;
sub funct{
my $val = shift;
## replacing , with empty, use anything what you want!
$val =~ s/,//g;
return $val;
}
Using funct($1) in substitute you are basically calling the funct() function with parameter $1

What does this line in DBI.pm do?

603 $dsn =~ s/^dbi:(\w*?)(?:\((.*?)\))?://i
or '' =~ /()/; # ensure $1 etc are empty if match fails
I don't understand what $dsn =~ s/^dbi:(\w*?)(?:\((.*?)\))?://i is for,even more doubt about '' =~ /()/,seems useless to me..
The first part is extracting two parts of the dsn string in the form:
dbi: first match ( optional second match ) :
These matches will be placed into $1 and $2 for the use in later code. The second part will only run if the match was unsuccessful. This is achieved by using or which will short-circuit (i.e. not execute) the second expression if the first one was successful.
As the comment says quite succinctly, it ensures that $1, $2, etc. are empty. Presumably so later code can check them and produce an appropriate error if they were not set (i.e. could not be extracted from the dsn string).
Equals-tilde, or =~, is the match operator.
Try the following code -- put it in a file, make executable with chmod +x, and run it:
#!/usr/bin/perl
$mystring = "Perl rocks.";
if ($mystring =~ /rocks/) {
print("Matches");
} else {
print("No match");
}
It will output Matches.
As for your example, it checks if the connection string is in the correct format, and extracts the database name, etc:
print($dsn);
$dsn = "dbi:SQLPlatform:database_name:host_name:port";
$dsn =~ s/^dbi:(\w*?)(?:\((.*?)\))?://i
or '' =~ /()/; # ensure $1 etc are empty if match fails
print($dsn);
Ouptuts database_name:host_name:port.
It's clear from the comments in the code:
602 # extract dbi:driver prefix from $dsn into $1
603 $dsn =~ s/^dbi:(\w*?)(?:\((.*?)\))?://i
604 or '' =~ /()/; # ensure $1 etc are empty if match fails
If you have problems understanding how s// and m// work see perlop and perlre.
If a capturing match fails $1 may still contain a value; the value of the last successful matching capture in the same dynamic scope, possibly from some other previous regexp. It appears the author didn't want a failed match at this point to leave some value in $1 from a previous regexp. To prevent this, he forced a "will always succeed" capturing match with nothing specified within the capturing parens. That means that there will be a match, and a capture of the empty string. In other words, $1 will now be empty rather than containing the match value from some previous successful match.
A more common idiom is simply to test for match success before executing whatever code will rely on $1's value, as in:
if( /(match)/ ) {
say $1;
}
While that's often the simplest approach, unfortunately code sometimes is not simple, and forcing that test into some complex code may make a tricky section even harder to deal with. That being the case, it may just be easier to ensure that $1 contains nothing after a failed match, rather than what it contained before the failed match.
I actually think that's a good question. Finding documentation of the behavior of #$1 after a failed match isn't easy within the Perl POD. I believe a more thorough explanation is found either in the camel book or the llama book. But I don't have them at my fingertips right now to check.
What is left out of the answers so far is the reason for that mysterious or '' =~ /()/. Without that bit of trickiness, $1 will be undefined if the match fails. The code is probably using $1 in a concatenation or a string shortly after this match. Doing this with $1 undefined will result in a "Use of uninitialized value $1 in concatenation (.) or string" warning if use warnings is in effect. With that or '' =~ /()/ trickiness in play, $1 will be defined (but empty) should the regular expression fail to match. This keeps that code that uses $1 from spewing.
The comment # ensure $1 etc are empty if match fails is incorrect. Get rid of that 'etc' and the comment is correct. This action sets $1, and $1 only. This code does not set $2. $2 will be undefined if the regular expression does not match.

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.