Perl regex becoming greedy when used (.*?) with anchors

Perl regex becoming greedy when used (.*?) with anchors - regex

I have a perl regex to add youtube video link to video tag. The YouTube videos link can be within anchors sometimes and sometimes without anchors. I have checked anchor with any value using (.*?) but it behaving as greedy. below is the regex that I am using.
$text =~ s#(^|\s|\>)(?:<a(.*?)\>)?((http|https)://(?:www.)?(?:youtu.be/|youtube.com(?:/embed/|/v/|/watch\?v=|/watch\?[a-z_=]+&(amp;)?v=))([\w-]{11}))[\?&\w;\=\+\-\.]*(\<\/a\>)?#$1\[video\]$3\[\/video\]#isg;
Please help to make it non-greedy.
Sample of input data:
<a rel="nofollow" href="https://www.facebook.com/photo.php?v=639296402756602" target="_blank">https://www.facebook.com/photo.php?v=639296402756602</a>
<a rel="nofollow" href="https://www.youtube.com/watch?v=9gTw2EDkaDQ" target="_blank">https://www.youtube.com/watch?v=9gTw2EDkaDQ</a>
I am expecting below ouput:
<a rel="nofollow" href="https://www.facebook.com/photo.php?v=639296402756602" target="_blank">https://www.facebook.com/photo.php?v=639296402756602</a>
[video]https://www.youtube.com/watch?v=9gTw2EDkaDQ[/video]
but it returns only youtube link. it is ignoring facebook video link.
[video]https://www.youtube.com/watch?v=9gTw2EDkaDQ[/video]

Do you really want to match > characters? I bet you don't... So don't use .* and that will solve your greediness problem. Use [^>]* instead. It's guaranteed to stop as soon as it hits the first > (even without tacking on a ?) because > doesn't match.

$text =~ s#(^|\s|\>)(?:<a(.*?)\>)?((http|https)://(?:www.)?(?:youtu.be/|youtube.com(?:/embed/|/v/|/watch\?v=|/watch\?[a-z_=]+&(amp;)?v=))([\w-]{11}))[\?&\w;\=\+\-\.]*(\<\/a\>)?#$1\[video\]$3\[\/video\]#isg;This regexp is unreadable and no one will want to read it. Remember, that regular expressions are programms too, and they need code formatting too.
Always use `smx` modifiers with all regexps, this is very good practice, like `always use strict and warnings`.
m - Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of line only at the left and right ends of the string to matching them anywhere within the string.
s - Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.
x - Extend your pattern's legibility by permitting whitespace and comments.
Then your code will look much more readable and you will see that it contains many unusable capturing groups, and dead code, and small bugs, like using of unescaped `.` in url capturing group.
After all modifications and as Dave Sherohman says using `[^>]*` instead of `.*?` your code will look much better, isn't it?. Check this out:
$text =~ s{
(?:<a[^>]*>)?
(
http[s]?://
(?:www[.])?
youtu[.]?be(?:[.]com)?
(?:/embed/|/v/|/watch\?v=|/watch\?[a-z_=]+&(?:amp;)?v=)
)
([\w-]{11})
[^<]*
(?:</a>)?
}
{
\[video\]$1$2\[/video\]
}smxgi;
And it works fine!

Related

Regular expression that simply returns the input string

I realize that this is sort of a silly question, but I need a regular expression that will simply return whatever is passed to it.
So, for instance, if the input string is ABC123, I want the regular expression to return ABC123.
The reason for this is that I need a "catch all" situation. I am storing a set of regular expressions in a database (one or more for each customer); however some customers don't have a need for any input string parsing...I just want to use what they are passing in as the matched text.

[\s\S]*
This will catch ALL characters.

You could search for white space or not any whitespace groups repeated.
For instance:
[\s\S]*
or
[\s.]*
See the bottom of
http://www.regular-expressions.info/dot.html

Since no one is saying it, I will go ahead. Keep it simple and use .*. Depending on your regular expression flavor, . may not match new-lines. However, there should be a modifier for that. For instance, in PHP the s modifier makes dots match new lines. In Ruby, this behavior is on by default.
<?hp
preg_match('/.*/s', "ABC\n123", $matches);
var_dump($matches);
// array(1) { [0]=> string(7) "ABC\n123" }
You can also see other peoples' examples of using a character class like [\s\S]*. But I'm assuming a character class will take (minimally) longer to parse than a ..
(?s) for "single line mode" makes the dot match all characters, including line breaks. Not supported by Ruby or JavaScript. In Tcl, (?s) also makes the caret and dollar match at the start and end of the string only.

very magic regex that will allow me to replace these images?

I'm using vim daily for manipulating text and to write code. However, every time I have to perform any substitution, or do any kind of regex work, it drives me crazy, and I have to switch to sublime. I'd like to know, what's the correct way of turning this:
<img src="whatever.png"/>
<img src="x.png"/>
into
<img src="<%= image_path("whatever.png") %>"/>
<img src="<%= image_path("x.png") %>"/>
In sublime, I can use this as the regex for search: src="(.*?.png)" and this as the regex for substitution: src="<%= asset_path("\1") %>". In vim, if I do this: :%s/\vsrc="(.*?.png)/src="<%= asset_path("\1") %>"/g I get:
E62: Nested ?
E476: Invalid command
What am I not doing right?

As #nhahtdh stated Vim's dialect of regex uses \{-} as the non-greedy quantifier. If you use the very magic flag it is just {-}. So your command turns into:
:%s/\vsrc="(.{-}.png)/src="<%= asset_path("\1") %>"/g
However you didn't escape the . in .png so:
:%s/\vsrc="(.{-}\.png)/src="<%= asset_path("\1") %>"/g
But we can still do better! By using \zs and \ze we can avoid retyping the src=" bit. \zs and \ze mark the start and end of the match where the substitution will occur.
:%s/\vsrc="\zs(.\{-}\.png)"/<%= image_path("\1") %>"/g
However we still are not done because we can take it one step further if we carefully choose where we put \zs and \ze then we can use vim's & in the substitution. It is like \0 in Perl's regex syntax. Now we don't need any capture groups which nullifies the need for the very magic flag.
:%s/src="\zs.\{-}\.png\ze"/<%= image_path("&") %>/g
For more help see the following documentation:
:h /\zs
:h /\{-
:h s/\&

According to this website, the syntax for lazy quantifier in vim is different from the syntax used in Perl-like regex.
Let me quote the website:
*/\{-*
\{-n,m} matches n to m of the preceding atom, as few as possible
\{-n} matches n of the preceding atom
\{-n,} matches at least n of the preceding atom, as few as possible
\{-,m} matches 0 to m of the preceding atom, as few as possible
\{-} matches 0 or more of the preceding atom, as few as possible
{Vi does not have any of these}
n and m are positive decimal numbers or zero
*non-greedy*
If a "-" appears immediately after the "{", then a shortest match
first algorithm is used (see example below). In particular, "\{-}" is
the same as "*" but uses the shortest match first algorithm.

:%s/"\(.*\)"/"<%= image_path("\1") %>"/g
The double quotes are out main pattern. Everything we want to capture gets thrown into a group \( \) so we can later relate to it via \1.
If you use very magic, you have to escape the =, thus \vsrc\=(.*).png". So using your way the answer is:
:%s/\vsrc\="(.*\.png)"/src="<%= image_path("\1") %>"/g
It's easy to see if you :set hlsearch and then play around with /. :)

Regex detect if a matched comma(,) does not lie in a regex

I am trying to figure out a way to determine if my matched comma(,) does not lie inside a regex. Basically, i do not want to match my character if it lies in a regex.
The regex i have come up with is ,(?<!.+\/)(?!.+\/) but its not quite working.
Any ideas?
I want to skip /some,regex/ but match any other commas.
Edit:
Live example: http://rubular.com/r/WjrwSnmzyP

Here is the regex that will work for you:
,(?!\s)(?=(?:(?:[^/]*\/){2})*[^/]*$)
Live Demo: http://rubular.com/r/37buDdg1tW
Explanation: It means match comma followed by EVEN number of forward slash /. Hence comma (,) between 2 slash (/) characters will NOT be matched and outside ones will be matched (since those are followed by even number of / characters).

A curious thing about regular expressions is that if you want to use them to ignore "something" that is within "something else", you need to match that "something else", prefer matches of it, and then either silently discard or reproduce those matches.
For example, in order to remove all commas from a string unless they are in a regular expression literal—
In Perl:
my $s = "/foo,bar/,baz";
$s =~ s{(/(?:[^/\\]|\\.)+/)|,}{\1}g;
In ECMAScript:
var s = "/foo,bar/,baz";
s = s.replace(/(\/([^\/\\]|\\.)+\/)|,/g, "$1");
or
s = s.replace(new RegExp("(/([^/\\\\]|\\\\.)+/)|,", "g"), "$1");
Note that I am capturing the match for the regular expression literal in the string value, and reproducing it (\1 or $1) if it matched. (If the other part of the alternation – the standalone comma – matched, the empty string is captured, so this simple approach suffices here.)
For further reading I recommend “Mastering Regular Expressions” by Jeffrey E. F. Friedl. Two rather enlightening example chapters, each from a different edition, are available for free online.

What REGEX pattern should I use to look for a specific string pattern and remove anything else that doesnt match?

I'm parsing through code using a Perl-REGEX parsing engine in my IDE and I want to grab any variables that look like
$hash->{ hash_key04}
and nuke the rest of the code..
So far my very basic REGEX doesnt do what I expected
(.*)(\$hash\-\>\{[\w\s]+\})(.*)
(
\$
hash
\-\>
\{
[\w\s]+
\}
)
I know to use replace for this ($1,$2,etc), but match (.*) before and after the target string doesnt seem to capture all the rest of the code!
UPADTED:
tried matching null but of course thats too greedy.
([^\0]*)
What expression in regex should i use to look only for the string pattern and remove the rest?
The problem is I want to be left with the list of $hash->{} strings after the replace runs in the IDE.

This is better approached from the other direction. Instead of trying to delete everything you don't want, what about extracting everything you do want?
my #vars = $src_text =~ /(\$hash->\{[\w\s]+\})/g;
Breaking down the regex:
/( # start of capture group
\$hash-> # prefix string with $ escaped
\{ # opening escaped delimiter
[\w\s]+ # any word characters or space
\} # closing escaped delimiter
)/g; # match repeatedly returning a list of captures
Here is another way that might fit within your IDE better:
s/(\$hash->\{[\w\s]+\})|./$1/gs;
This regex tries to match one of your hash variables at each location, and if it fails, it deletes the next character and then tries again, which after running over the whole file will have deleted everything you don't want.

Depends on your coding language. What you want is group 2 (The second set of characters in parenthesis). In perl that would be $2, in VIM it would be \2, etc ...

It depends on the platform, but generally, replace the pattern with an empty string.
In javascript,
// prints "the la in ing"
console.log('the latest in testing'.replace(/test/g, ''));
In bash
$ echo 'the latest in testing' | sed 's/test//g'
the la in ing
In C#
Console.WriteLine(Regex.Replace("the latest in testing", "test", ""));
etc

By default the wildcard . won't match newlines. You can enable newlines in its matching set using a flag depending on what regex standard you're using and under what language/api. Or you can add them explicitly yourself by defining a character set:
[.\n\r]* <- Matches any character including newline, carriage return.
Combine this with capture groups to grab desired variables from your code and skip over lines which contain no capture group.
If you want help constructing the proper regex for your context you'll need to paste some input text and specify what the output should be.

I think you want to add a ^ to the beginning of the regex s/^.(PATTERN)(.)$/$1/ so that it starts at the beginning of the line and goes to the end, removing anything except that pattern.

RegEx: Grabbing values between quotation marks

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?

In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']

I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.

I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.

Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'

Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.

The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.

I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.

A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1

The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.

This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/

MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "

I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1

string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character

My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match

Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.

All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.

echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.

From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter

If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".

A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl regex becoming greedy when used (.*?) with anchors - regex

Do you really want to match > characters? I bet you don't... So don't use .* and that will solve your greediness problem. Use [^>]* instead. It's guaranteed to stop as soon as it hits the first > (even without tacking on a ?) because > doesn't match.

Related

Regular expression that simply returns the input string

very magic regex that will allow me to replace these images?

Regex detect if a matched comma(,) does not lie in a regex

What REGEX pattern should I use to look for a specific string pattern and remove anything else that doesnt match?

RegEx: Grabbing values between quotation marks

Categories

Resources