Strange result on perl regexp - end string anchor & ungreedy at once

Strange result on perl regexp - end string anchor & ungreedy at once - regex

I have a very simple substitution:
my $s = "<a>test</a> <a>test</a>";
$s =~ s{ <a> .+? </a> $ }{WHAT}x;
print "$s\n";
that prints:
WHAT
But I was expecting:
<a>test</a> WHAT
What do I misunderstand about "end string anchor" in interaction with ungreedy option?
So, I was wrong about regexp engine. Indeed, dont humanize code - it doing rightly what you wrote, not you "think do".
Its just find first <a>, then find </a>$. First lockup are positive, pattern matched.
Right pattern must be something about:
$s =~ s{ <a> (?! .* <a> ) .* </a> }{WHAT}x;
thats give me correctly
<a>test</a> WHAT
because now I really asked regexp for last <a>.
I think its less efficient [^<]+, but more flexible.

This is one of the reasons you don't use a regex to match HTML. Try using a parser instead. See this question and its answers for more reasons not use a regex, and this question and its answers for examples of how to use an HTML parser.

The non-greedy modifier (and regexes in general) works from left-to-right, so in essence what is happening here is that it tries to find the shortest string that matches after the first <a> until the next </a> that is at the end of the string.
This does what you would expect:
my $s="<a>test</a> <a>test</a>";
$s =~ s#<a>[^<>]+</a>$#WHAT#;
print "$s\n";
What is the problem you're trying to solve?

Related

How to match repetition of a word in regular expression of perl

The situation is very simple. The word "gat" may appear 0 or 1 time in a string. How can I write regex to match it?
Right now I can only use the following to do what I want. It works in my situation, though it would also match "ga", "at" etc.
$str =~ m/(g?a?t?)/
I guess there is a much easier expression to use "?" on the word "gat", but I tried "{}" and it doesn't work.
Thanks!

Use a Non-capturing Group and the ? quantifier
$str =~ m/...(?:gat)?.../
Can also be written as:
$str =~ m/...(?:gat){0,1}.../

.*?(\b(?:gat)\b)?
Try this.This will give all gat.
http://regex101.com/r/pP3pN1/33

Lookbehind does not work as expected

I am trying to understand the lookbehind.
This example I am trying doesn't work as I expected. I wanted to try to form a regex that would match John but not John.
The following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*/) {
print "Matches!\n";
}
'
Matches!
matches up to and including . of course. The problem is the following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*(?<![.])/) {
print "Matches!\n";
}
'
Matches!
For the latter I expected that the regex would match John. consuming >.< (the period)
Then at the next position it would look behind and realize that it consumed a period (.) and would reject the match.
Is my understanding wrong? What am I messing up here?
Update:
Same result also for my $var = "John. ";
Update 2:
My question is not about how to match only John and not John.
But to understand how lookbehind works and if it is not supposed to work in this case why.

The * is a quantification operator, not a placeholder. So A* means zero or more A characters. Without any further context, this always matches, e.g. "foo" =~ /J*/ is true.
What you intended to write was /J.*/ which does what you've actually described.
Now let's look what happens when we do "John." =~ /(J.*(?<![.]))/:
The regex engine sees J, which matches.
The next pattern is .*, which matches ohn..
Next the assertion (?<![.]) is tested, which fails.
The regex engine therefore backtracks.
We try .* again, but this time only match ohn.
Next the assertion (?<![.]) is tested, which suceeds.
In the above regex, I enclosed the pattern in a capture group, which we can now read out:
$ perl -E'"John." =~ /(J.*(?<![.]))/ and say "<$1>" or say "No match"'
<John>
It is often more efficient to use a character class instead of assertions and .* quantifications, so that we can avoid backtracking:
/J[^.]*/
However, this is not strictly equivalent to the above regexes.

This regexp:
/John(?![.])/
will match John but not John. It uses a negative look-ahead assertion (rather than look-behind).
If you want to match full names other than 'John', you'll need to be a bit more specific about what you do and don't want to allow in the match, as putting J* will match zero or more J's.
Edit: Obviously I misread the * per #amon's post. Look-ahead vs. look-behind still applies.

Perl regex becoming greedy when used (.*?) with anchors

I have a perl regex to add youtube video link to video tag. The YouTube videos link can be within anchors sometimes and sometimes without anchors. I have checked anchor with any value using (.*?) but it behaving as greedy. below is the regex that I am using.
$text =~ s#(^|\s|\>)(?:<a(.*?)\>)?((http|https)://(?:www.)?(?:youtu.be/|youtube.com(?:/embed/|/v/|/watch\?v=|/watch\?[a-z_=]+&(amp;)?v=))([\w-]{11}))[\?&\w;\=\+\-\.]*(\<\/a\>)?#$1\[video\]$3\[\/video\]#isg;
Please help to make it non-greedy.
Sample of input data:
<a rel="nofollow" href="https://www.facebook.com/photo.php?v=639296402756602" target="_blank">https://www.facebook.com/photo.php?v=639296402756602</a>
<a rel="nofollow" href="https://www.youtube.com/watch?v=9gTw2EDkaDQ" target="_blank">https://www.youtube.com/watch?v=9gTw2EDkaDQ</a>
I am expecting below ouput:
<a rel="nofollow" href="https://www.facebook.com/photo.php?v=639296402756602" target="_blank">https://www.facebook.com/photo.php?v=639296402756602</a>
[video]https://www.youtube.com/watch?v=9gTw2EDkaDQ[/video]
but it returns only youtube link. it is ignoring facebook video link.
[video]https://www.youtube.com/watch?v=9gTw2EDkaDQ[/video]

Do you really want to match > characters? I bet you don't... So don't use .* and that will solve your greediness problem. Use [^>]* instead. It's guaranteed to stop as soon as it hits the first > (even without tacking on a ?) because > doesn't match.

$text =~ s#(^|\s|\>)(?:<a(.*?)\>)?((http|https)://(?:www.)?(?:youtu.be/|youtube.com(?:/embed/|/v/|/watch\?v=|/watch\?[a-z_=]+&(amp;)?v=))([\w-]{11}))[\?&\w;\=\+\-\.]*(\<\/a\>)?#$1\[video\]$3\[\/video\]#isg;This regexp is unreadable and no one will want to read it. Remember, that regular expressions are programms too, and they need code formatting too.
Always use `smx` modifiers with all regexps, this is very good practice, like `always use strict and warnings`.
m - Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of line only at the left and right ends of the string to matching them anywhere within the string.
s - Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.
x - Extend your pattern's legibility by permitting whitespace and comments.
Then your code will look much more readable and you will see that it contains many unusable capturing groups, and dead code, and small bugs, like using of unescaped `.` in url capturing group.
After all modifications and as Dave Sherohman says using `[^>]*` instead of `.*?` your code will look much better, isn't it?. Check this out:
$text =~ s{
(?:<a[^>]*>)?
(
http[s]?://
(?:www[.])?
youtu[.]?be(?:[.]com)?
(?:/embed/|/v/|/watch\?v=|/watch\?[a-z_=]+&(?:amp;)?v=)
)
([\w-]{11})
[^<]*
(?:</a>)?
}
{
\[video\]$1$2\[/video\]
}smxgi;
And it works fine!

Shortest match issues

I know the ? operator enables "non greedy" mode, but I am running into a problem, I can't seem to get around. Consider a string like this:
my $str = '<a>sdkhfdfojABCasjklhd</a><a>klashsdjDEFasl;jjf</a><a>askldhsfGHIasfklhss</a>';
where there are opening and closing tags <a> and </a>, there are keys ABC, DEF and GHI but are surrounded by some other random text. I want to replace the <a>klashsdjDEFasl;jjf</a> with <b>TEST</b> for example. However, if I have something like this:
$str =~ s/<a>.*?DEF.*?<\/a>/<b>TEST><\/b>/;
Even with the non greedy operators .*?, this does not do what I want. I know why it does not do it, because the first <a> matches the first occurrence in the string, and matches all the way up to DEF, then matches to the nearest closing </a>. What I want however is a way to match the closest opening <a> and closing </a> to "DEF" though. So currently, I get this as the result:
<a>TEST</b><a>askldhsfGHIasfklhss</a>
Where as I am looking for something to get this result:
<a>sdkhfdfojABCasjklhd</a><b>TEST</b><a>askldhsfGHIasfklhss</a>
By the way, I am not trying to parse HTML here, I know there are modules to do this, I am simply asking how this could be done.
Thanks,
Eric Seifert

$str =~ s/(.*)<a>.*?DEF.*?<\/a>/$1<b>TEST><\/b>/;
The problem is that even with non-greedy matching, Perl is still trying to find the match that starts at the leftmost possible point in the string. Since .*? can match <a> or </a>, that means it will always find the first <a> on the line.
Adding a greedy (.*) at the beginning causes it to find the last possible matching <a> on the line (because .* first grabs the whole line, and then backtracks until a match is found).
One caveat: Because it finds the rightmost match first, you can't use this technique with the /g modifier. Any additional matches would be inside $1, and /g resumes the search where the previous match ended, so it won't find them. Instead, you'd have to use a loop like:
1 while $str =~ s/(.*)<a>.*?DEF.*?<\/a>/$1<b>TEST><\/b>/;

Instead of a dot which says: "match any character", use what you really need which says: "match any char that is not the start of </a>". This translates into something like this:
$str =~ s/<a>(?:(?!<\/a>).)*DEF(?:(?!<\/a>).)*<\/a>/<b>TEST><\/b>/;

#!/usr/bin/perl
use warnings;
use strict;
my $str = '<a>sdkhfdfojABCasjklhd</a><a>klashsdjDEFasl;jjf</a><a>askldhsfGHIasfklhss</a>';
my #collections = $str =~ /<a>.*?(ABC|DEF|GHI).*?<\/a>/g;
print join ", ", #collections;

s{
<a>
(?: (?! </a> ) . )*
DEF
(?: (?! </a> ) . )*
</a>
}{<b>TEST</b>}x;
Basically,
(?: (?! PAT ) . )
is the equivalent of
[^CHARS]
for regex patterns instead of characters.

Based on my understanding, this is what you are looking for.
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.

How can I preserve whitespace when I match and replace several words in Perl?

Let's say I have some original text:
here is some text that has a substring that I'm interested in embedded in it.
I need the text to match a part of it, say: "has a substring".
However, the original text and the matching string may have whitespace differences. For example the match text might be:
has a
substring
or
has a substring
and/or the original text might be:
here is some
text that has
a substring that I'm interested in embedded in it.
What I need my program to output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
I also need to preserve the whitespace pattern in the original and just add the start and end markers to it.
Any ideas about a way of using Perl regexes to get this to happen? I tried, but ended up getting horribly confused.

Been some time since I've used perl regular expressions, but what about:
$match = s/(has\s+a\s+substring)/[$1]/ig
This would capture zero or more whitespace and newline characters between the words. It will wrap the entire match with brackets while maintaining the original separation. It ain't automatic, but it does work.
You could play games with this, like taking the string "has a substring" and doing a transform on it to make it "has\s*a\s*substring" to make this a little less painful.
EDIT: Incorporated ysth's comments that the \s metacharacter matches newlines and hobbs corrections to my \s usage.

This pattern will match the string that you're looking to find:
(has\s+a\s+substring)
So, when the user enters a search string, replace any whitespace in the search string with \s+ and you have your pattern. The, just replace every match with [match starts here]$1[match ends here] where $1 is the matched text.

In regexes, you can use + to mean "one or more." So something like this
/has\s+a\s+substring/
matches has followed by one or more whitespace chars, followed by a followed by one or more whitespace chars, followed by substring.
Putting it together with a substitution operator, you can say:
my $str = "here is some text that has a substring that I'm interested in embedded in it.";
$str =~ s/(has\s+a\s+substring)/\[match starts here]$1\[match ends here]/gs;
print $str;
And the output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.

A many has suggested, use \s+ to match whitespace. Here is how you do it automaticly:
my $original = "here is some text that has a substring that I'm interested in embedded in it.";
my $search = "has a\nsubstring";
my $re = $search;
$re =~ s/\s+/\\s+/g;
$original =~ s/\b$re\b/[match starts here]$&[match ends here]/g;
print $original;
Output:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
You might want to escape any meta-characters in the string. If someone is interested, I could add it.

This is an example of how you could do that.
#! /opt/perl/bin/perl
use strict;
use warnings;
my $submatch = "has a\nsubstring";
my $str = "
here is some
text that has
a substring that I'm interested in, embedded in it.
";
print substr_match($str, $submatch), "\n";
sub substr_match{
my($string,$match) = #_;
$match =~ s/\s+/\\s+/g;
# This isn't safe the way it is now, you will need to sanitize $match
$string =~ /\b$match\b/;
}
This currently does anything to check the $match variable for unsafe characters.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Strange result on perl regexp - end string anchor & ungreedy at once - regex

This is one of the reasons you don't use a regex to match HTML. Try using a parser instead. See this question and its answers for more reasons not use a regex, and this question and its answers for examples of how to use an HTML parser.

Related

How to match repetition of a word in regular expression of perl

Lookbehind does not work as expected

Perl regex becoming greedy when used (.*?) with anchors

Shortest match issues

How can I preserve whitespace when I match and replace several words in Perl?

Categories

Resources