How can I write a regex which matches non greedy? [duplicate] - regex

This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
How do I match any character across multiple lines in a regular expression?
(26 answers)
What is the difference between .*? and .* regular expressions?
(3 answers)
RegEx: Smallest possible match or nongreedy match
(3 answers)
Closed 3 years ago.
I need help about regular expression matching with non-greedy option.
The match pattern is:
<img\s.*>
The text to match is:
<html>
<img src="test">
abc
<img
src="a" src='a' a=b>
</html>
I test on http://regexpal.com
This expression matches all text from <img to last >. I need it to match with the first encountered > after the initial <img, so here I'd need to get two matches instead of the one that I get.
I tried all combinations of non-greedy ?, with no success.

The non-greedy ? works perfectly fine. It's just that you need to select dot matches all option in the regex engines (regexpal, the engine you used, also has this option) you are testing with. This is because, regex engines generally don't match line breaks when you use .. You need to tell them explicitly that you want to match line-breaks too with .
For example,
<img\s.*?>
works fine!
Check the results here.
Also, read about how dot behaves in various regex flavours.

The ? operand makes match non-greedy. E.g. .* is greedy while .*? isn't. So you can use something like <img.*?> to match the whole tag. Or <img[^>]*>.
But remember that the whole set of HTML can't be actually parsed with regular expressions.

The other answers here presuppose that you have a regex engine which supports non-greedy matching, which is an extension introduced in Perl 5 and widely copied to other modern languages; but it is by no means ubiquitous.
Many older or more conservative languages and editors only support traditional regular expressions, which have no mechanism for controlling greediness of the repetition operator * - it always matches the longest possible string.
The trick then is to limit what it's allowed to match in the first place. Instead of .* you seem to be looking for
[^>]*
which still matches as many of something as possible; but the something is not just . "any character", but instead "any character which isn't >".
Depending on your application, you may or may not want to enable an option to permit "any character" to include newlines.
Even if your regular expression engine supports non-greedy matching, it's better to spell out what you actually mean. If this is what you mean, you should probably say this, instead of rely on non-greedy matching to (hopefully, probably) Do What I Mean.
For example, a regular expression with a trailing context after the wildcard like .*?><br/> will jump over any nested > until it finds the trailing context (here, ><br/>) even if that requires straddling multiple > instances and newlines if you let it, where [^>]*><br/> (or even [^\n>]*><br/> if you have to explicitly disallow newline) obviously can't and won't do that.
Of course, this is still not what you want if you need to cope with <img title="quoted string with > in it" src="other attributes"> and perhaps <img title="nested tags">, but at that point, you should finally give up on using regular expressions for this like we all told you in the first place.

Related

Gvim regex: find matching XML tag pairs, non-greedy [duplicate]

This question already has answers here:
How can I make my match non greedy in vim?
(8 answers)
Closed 2 years ago.
So I have the following sample XML on a single line:
<foo>123</foo> <foo>456</foo> <bar>abc</bar> <foo>789</foo> <foo>0AB</foo> <bar>def</bar>
I'm looking for a regex which matches the first pair of <foo> tags, and which stops at the first <bar>
I'm trying solutions around:
/<foo>.\+<\/foo>.\+<bar
But this matches the entire thing. How do I get it to stop at the first <bar> ?
This happens because by default, regular expressions are greedy; that is, they match as much data as possible. However, in this case, what you want is a non-greedy regex so you match only the first part.
<foo>.\{-}<\/foo>.\{-}<bar
The pattern \{-} is equivalent to *, but is non-greedy, like Perl's *?. See :help non-greedy for more details.
As a side note, you cannot parse HTML or XML in the general case with regular expressions (since regexes are not powerful enough), but in this case I assume that you have a limited subset of data where this is good enough.

Different regex evaluation in collections or patterns

I am experiencing a strange behaviour when searching for a regular expression in vim:
I attempt to clean up superfluous whitespace in a file and want to use the substitute command for it.
When I use the following regular expression with collections, vim matches single whitespaces as well:
\%[\s]\{2,}
When I use the same regular expression with patterns instead of collections vim correctly matches only 2 or more whitespaces:
\%(\s\)\{2,}
I know that I do not need to use a collection, but if I try the expression in a online regular expression parser (e.g. Rubular) it works with a collection as well.
Can anyone explain why these expression are not evaluated in the same way?
Because \%[...] and \%(...\) are completely different patterns.
\%[...] means a sequence of optional atoms.
For example, r\%[ead] matches "read", "rea", "re" and "r".
While \%(...\) treats the enclosed atoms as a single atom.
For example, r\%(ead\) matches only "read".
So that,
\%[\s]\{2,} can be interpreted as \(\s\|\)\{2,}, then \(\s\|\)\(\s\|\)\|\(\s\|\)\(\s\|\)\(\s\|\)\|....
Here \(\s\|\)\(\s\|\), the minimum pattern, can be interpreted as \(\)\(\), \(\)\(\s\), \(\s\)\(\) or \(\s\)\(\s\).
It matches 1 whitespace character too.
\%(\s\)\{2,} can be interpreted as \s\{2,}, then \s\s\|\s\s\s\|....
It matches only 2 or more whitespace characters.
does this answer your question?
http://vimdoc.sourceforge.net/htmldoc/pattern.html#/\%[]
A sequence of optionally matched atoms. This always matches.
It matches as much of the list of atoms it contains as possible.
Thus it stops at the first atom that doesnt match.
For example:
/r\%[ead]
matches "r", "re", "rea" or "read". The longest that matches is used.
The problem is it always match and override the quantifier {2,} at the back.
it is rarely used, but interesting nevertheless.

My regular expression matches too much. How can I tell it to match the smallest possible pattern? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I have this RegEx:
('.+')
It has to match character literals like in C. For example, if I have 'a' b 'a' it should match the a's and the ''s around them.
However, it also matches the b also (it should not), probably because it is, strictly speaking, also between ''s.
Here is a screenshot of how it goes wrong (I use this for syntax highlighting):
I'm fairly new to regular expressions. How can I tell the regex not to match this?
It is being greedy and matching the first apostrophe and the last one and everything in between.
This should match anything that isn't an apostrophe.
('[^']+')
Another alternative is to try non-greedy matches.
('.+?')
Have you tried a non-greedy version, e.g. ('.+?')?
There are usually two modes of matching (or two sets of quantifiers), maximal (greedy) and minimal (non-greedy). The first will result in the longest possible match, the latter in the shortest. You can read about it (although in perl context) in the Perl Cookbook (Section 6.15).
Try:
('[^']+')
The ^ means include every character except the ones in the square brackets. This way, it won't match 'a' b 'a' because there's a ' in between, so instead it'll give both instances of 'a'
You need to escape the qutoes:
\'[^\']+\'
Edit: Hmm, we'll I suppose this answer depends on what lang/system you're using.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part

Regular Expression Opposite

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)