What matches this regex in perl: qr!$line!; - regex

In one source code I found this regex:
my $var = qr!$my_string!;
I simply can't figure out what this matches. I also searched online, the explanation to qr is quite simple.
Can someone explain it in an human language, please? :)

Most quote-like operators can take nearly anything for delimiters.† The qr is no exception, here using ! as delimiter. It also evaluates ("interpolates") variables that are quoted. So that line of code builds a regex pattern using what is in $my_string variable. This pattern is presumably used later in code in regex expressions. This is normal use.
A complete example:
use warnings;
use strict;
use feature 'say';
my $str = q(hey);
my $re = qr!$str!;
my $tgt = q(A_hey_C);
$tgt =~ s/$re/B/;
say $tgt; #--> A_B_C
The purpose of qr is specifically to construct a regex pattern so one would expect that the $str above ($my_string in the question), or the pattern in qr, would contain regex-specific patterns, perhaps along with other variables assembled in the program. The $str above with just a plain string can nicely be used directly in a regex, so this isn't a realistic example.
† See
What are the legal delimiters for Perl 5's pick-your-own-quotes operators?

Related

perl regular expressions substitue

I'm new to perl and I found this expressions bit difficult to understand
$a =~ s/\'/\'\"\'\"\'/g
Can someone help me understand what this piece of code does?
It is not clear which part of it is a problem. Here is a brief mention of each.
The expression $str =~ s/pattern/replacement/ is a regular expression, replacing the part of string $str that is matched by the "pattern", by the "replacement". With /g at the end it does so for all instances of the "pattern" found in the string. There is far, far more to it -- see the tutorial, a quick start, the reference, and full information in perlre and perlop.
Some characters have a special meaning, in particular when used in the "pattern". A common set is .^$*+?()[{\|, but this depends on the flavor of regex. Also, whatever is used as a delimiter (commonly /) has to be escaped as well. See documentation and for example this post. If you want to use any one of those characters as a literal character to be found in the string, they have to be escaped by the backslash. (Also see about escaped sequences.) In your example, the ' is escaped, and so are the characters used as the replacement.
The thing is, these in fact don't need to be escaped. So this works
use strict;
use warnings;
my $str = "a'b'c";
$str =~ s/'/'"'/g;
print "$str\n";
It prints the desired a'"'b'"'c. Note that you still may escape them and it will work. One situation where ' is indeed very special is in a one liner, where it delimits the whole piece of code to submit to Perl via shell, to be run. (However, merely escaping it there does not work.) Altogether, the expression you show does the replacement just as in the answer by Jens, but with a twist that is not needed.
A note on a related feature: you can use nearly anything for delimiters, not necessarily just /. So this |patt|repl| is fine, and nicely even this is: {patt}{repl} (unless they are used inside). This is sometimes extremely handy, improving readability a lot. For one thing, then you don't have to escape the / itself inside the regex.
I hope this helps a little.
All the \ in that are useless (but harmless), so it is equivalent to s/'/'"'"'/g. It replaces every ' with '"'"' in the string.
This is often used for shell quoting. See for example https://stackoverflow.com/a/24869016/17389
This peace of Code replace every singe Quote with '"'"'
Can be simplified by removing the needless back slashes
$a =~ s/'/"'"'/g
Every ' will we replaced with "'"'

What is that meaning of $testModule =~ s#/#::#ig; in Perl?

I am a newbie of perl and I'm reading some perl codes, I find one line below I could't understand, can anyone tell what is the meaning of
s#/#::#ig
I know =~ is try to match some regular expression. usually I would see code like s/<regular express>//gi, so I was a little confused of the following code. can anyone help to elaborate?
$testModule =~ s#/#::#ig;
You can use lots of different characters as regex delimiters.
This one is using # instead of / so it can use / as data inside the regex without escaping it.
It's equivalent to:
$testModule =~ s/\//::/ig;
See quote and quote-like operators in the Perl documentation for more details.

Backtick Character ` in Perl Regex Meaning

I have inherited a Perl codebase that uses regex to parse an XML file. Not optimal, I know. I have lines of code like:
$title =~ s`<(.*?)>``g;
and
$content =~ s`__DOUBLEFIG__`${fig_html}`;
The standard Perl find and replace takes the form of
s/foo/bar/g;
What does
s`foo`bar`g;
do?
Those are alternative ways of delimiting the regex. Whomever wrote that code chose to use something else for personal reasons or, possibly, for code readability. Often, if patterns are often going to involve /, something else will be chosen to avoid having to escape the slash character in the regex.
This answer provides more information.

regex taking a long time

I have the following script, which grabs a webpage, then does a regex to find items I'm looking for:
use warnings;
use strict;
use LWP::Simple;
my $content=get('http://mytempscripts.com/2011/09/temporary-post.html') or die $!;
$content=~s/\n//g;
$content=~s/ / /g;
$content=~/<b>this is a temp post<\/b><br \/><br \/>(.*?)<div style='clear: both;'><\/div>/;
my $temp=$1;
while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9] {1,})(.*?)\s+)/g){
print "found a match\n";
}
This works, but takes a long, long time. When I shorten the regex to the following, I get the results in less than a second. Why does my original regex take so long? How do I correct it?
while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9] {1,})(.*?)\s+)/g){
print "found a match\n";
}
Regular expressions are like the sort function in Perl. You think it's pretty simple because it's just a single command, but in the end, it uses a lot of processing power to do the job.
There are certain things you can do to help out:
Keep your syntax simple as possible.
Precompile your regular expression pattern by using qr// if you're using that regular expression in a loop. That'll prevent Perl from having to compile your regular expression with each loop.
Try to avoid regular expression syntax that has to do backtracking. This usually ends up being the most general matching patterns (such as .*).
The wretched truth is that after decades of writing in Perl, I've never masted the deep dark secrets of regular expression parsing. I've tried many times to understand it, but that usually means doing research on the Web, and ...well... I get distracted by all of the other stuff on the Web.
And, it's not that difficult, any half decent developer with an IQ of 240, and a penchant for sadism should easily be able to pick it up.
#David W.: I guess I'm confused on backtracking. I had to read your link several times but still don't quite understand how to implement it (or, not implement it) in my case. – user522962
Let's take a simple example:
my $string = 'foobarfubar';
$string =~ /foo.*bar.*(.+)/;
my $result = $1;
What will $result be? It will be r. You see how that works? Let's see what happens.
Originally, the regular expression is broken into tokens, and the first token foo.* is used. That actually matches the whole string:
"foobarfubar" =~ /foo.*/
However, if the first regular expression token captures the whole string, the rest of the regular expression fails. Therefore, the regular expression matching algorithm has to back track:
"foobarfubar" =~ /foo.*/ #/bar.*/ doesn't match
"foobarfuba" =~ /foo.*/ #/bar.*/ doesn't match.
"foobarfub" =~ /foo.*/ #/bar.*/ doesn't match.
"foobarfu" =~ /foo.*/ #/bar.*/ doesn't match.
"foobarf" =~ /foo.*/ #/bar.*/ doesn't match.
"foobar" =~ /foo.*/ #/bar.*/ doesn't match.
...
"foo" =~ /foo.*/ #Now /bar.*/ can match!
Now, the same happens for the rest of the string:
"foobarfubar" =~ /foo.*bar.*/ #But the final /.+/ doesn't match
"foobarfuba" =~ /foo.*bar.*/ #And the final /.+/ can match the "r"!
Backtracking tends to happen with the .* and .+ expression since they're so loose. I see you're using non-greedy matches which can help, but it can still be an issue if you are not careful -- especially if you have very long and complex regular expressions.
I hope this helps explain backtracking.
The issue you're running into isn't that your program doesn't work, but that it takes a long, long time.
I was hoping that the general gist of my answer is that regular expression parsing isn't as simple as Perl makes it out to be. I can see the command sort #foo; in a program, but forget that if #foo contains a million or so entries, it might take a while. In theory, Perl could be using a bubble sort and thus the algorithm is a O2. I hope that Perl is actually using a more efficient algorithm and my actual time will be closer to O * log (O). However, all this is hidden by my simple one line statement.
I don't know if backtracking is an issue in your case, but you're treating an entire webpage output as a single string to match against a regular expression which could result in a very long string. You attempt to match it against another regular expression which you do over and over again. Apparently, that is quite a process intensive step which is hidden by the fact it's a single Perl statement (much like sort #foo hides its complexity).
Thinking about this on and off over the weekend, you really should not attempt to parse HTML or XML with regular expressions because it is so sloppy. You end up with something rather inefficient and fragile.
In cases like this may be better off using something like HTML::Parser or XML::Simple which I'm more familiar with, but doesn't necessarily work with poorly formatted HTML.
Perl regular expressions are nice, but they can easily get out of our control.
One thing you might try is changing all the capture groups (...) to non-capture groups (?:...)
That will save some effort for the matcher if all you need to is print out "found a match", but I'm not sure you can do that in reality if your real code does more.
Also, just generally speaking having a lot of wildcards like (.*?) is just going to add weight I think, so maybe knowing what you are trying to match you will be able to eliminate some of those? I can't say for sure; don't see any purely formal optimizations here.

Regex to replace all ocurrences of a given character, ONLY after a given match

For the sake of simplicity, let's say that we have input strings with this format:
*text1*|*text2*
So, I want to leave text1 alone, and remove all spaces in text2.
This could be easy if we didn't have text1, a simple search and replace like this one would do:
%s/\s//g
but in this context I don't know what to do.
I tried with something like:
%s/\(.*|\S*\).\(.*\)/\1\2/g
which works, but removing only the first character, I mean, this should be run on the same line one time for each offending space.
So, a preferred restriction, is to solve this with only one search and replace. And, although I used Vim syntax, use the regular expression flavor you're most comfortable with to answer, I mean, maybe you need some functionality only offered by Perl.
Edit:
My solution for Vim:
%s:\(|.*\)\#<=\s::g
One way, in perl:
s/(^.*\||(?=\s))\s*/$1/g
Certainly much greater efficiency is possible if you allow more than just one search and replace.
So you have a string with one pipe (|) in it, and you want to replace only those spaces that don't precede the pipe?
s/\s+(?![^|]*\|)//g
You might try embedding Perl code in a regular expression (using the (?{...}) syntax), which is, however, rather an experimental feature and might not work or even be available in your scenario.
This
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s:\s::g })/$1$x/
should theoretically work, but I got an "Out of memory!" failure, which can be fixed by replacing '\s' with a space:
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s: ::g })/$1$x/