When to choose [^x]* or .*? - regex

Assume i have a substring in a longer string like (...)aaabaacaaaaaXaaaadaeaa(...) and i want to match or replace the aaabaacaaaaa with the X as delimiter.
I can now use (.*?)X to find the string before the X or i can use ([^X]*) to find it. I could also use negative look-ahead but i don't think it is necessary in this case.
So which one of the two (or three) options is the better technique to get the group i want to match in this context?

Take this very simple example:
www\..*?\.com
www\.[^.]*\.com
The first one matches any input that contains a www. and a .com with anything in between. The second matches a www. and a .com that does not have a . in-between.
The first would match: www.google.something.com
The second would not.
Only use the negated class if that section absolutely cannot contain the character.

.*? is called lazy quantifier.
[^X]* is called greedy negation quantifier
Wherever possible use negation i.e. [^X] since it doesn't cause backtracking. Ofcourse if your input text can contain letter X then you have no choice but to use .*?
I am copying this text from one of the recent comment from #ridgerunner:
The expression: [^X)]* is certainly more efficient than .*? in
every language except possibly Perl (whose regex engine is highly
optimized for the lazy dot star expression). The expression .*? must
stop and backtrack once at every character position as it
"bumps-along", whereas the greedy quantifier applied to the negated
character class expression can consume the entire chunk in a single
step, with no backtracking.

Related

How do you search/replace the nth occurrence in vim visual mode?

This works:
'<,'>s/\v\/\zs(\/)//
'<,'>s/\v(\/)#<=\//BAR/
I was just wondering if there was an easier way to replace the nth occurrence with a {} or something in vim.
Replace the third forward slash '/'.
/dir1//fas//fooBar/¬
/dir2//\.foobar//fas/¬
/dir//.foo//fas/¬
How would I replace the fourth 'foo' ?
foo foo foo foo foo foo foo
foo foo foo foo foo foo foo
I'll discuss matching these patterns for simplicity, replacing them in a substitute command should work the same, using the same pattern on the :s command.
Replace the third forward slash '/'.
With a one-character match this is easier, since you can use [^/] to find characters that are not part of the match.
If you want to count matches, you need to start from the beginning of the line, so anchor with ^.
At that point, you can match two instances of "not slashes" followed by a "slash", and then on the third one you can use a \zs to mark it as the start of the actual match.
It's a bit unfortunate that / itself will need to be escaped with \/ if we use it on a match, but the resulting pattern is:
/\v^%([^\/]*\/){2}[^\/]*\zs\/
One common tip for patterns that include / is to search backwards using ? instead, so let's do that to improve readability:
?\v^%([^/]*/){2}[^/]*\zs/
The patterns pattern items I used here that might be unfamiliar to some are:
%(...): Groups a pattern, same as (...) but doesn't create a capture group.
{2}: Matches the preceding pattern exactly twice.
Remember we're using "verymagic" with \v, so most of the above won't require backslashes.
There's a neat shortcut we can take to shorten the pattern above (and that will help us when we look at the case of the longer word), which is that if you have \zs in multiple places in your pattern, then the last one to match will be the one that will define the actual start of the match. (See :help /\zs.)
So we can simplify that to:
?\v^%([^/]*\zs/){3}
We match "not slashes" followed by a "slash" three times. The \zs will only take effect on the last (third) match, so you'll end up matching the third slash on the line.
Now let's move on to the more complicated case of matching a word:
How would I replace the fourth 'foo' ?
Here we can't use [^...] to match "not foo". I mean, we could use something like \v([^f]|f[^o]|fo[^o]) but that grows quickly as the word you're matching grows. And there's a better way to do it.
We can use a zero-width negative look-behind! See :help /\#<! for this interesting operator. In short, it takes the preceding atom (we'll use a group with the word here) and makes sure that that item does not match ending at that location.
So we can use this:
/\v^%(%(.%(foo)#<!)*\zsfoo){4}
The %(foo)#<! here ensures that each . we match will not be the last o in foo. That way we can accurately count the first, second, third and fourth foo on the line and make sure we won't match the fifth, sixth or seventh.
Here again we're using the trick of repeating it four times (to find the fourth match) and having the last \zs stick.
Note that the negative look-behind works well with a fixed word, but if you start having multis such as * or + etc. then things get a lot more complicated. Take a look at the help for the operator and the warnings that it can be slow. There are also a variant of the operator that limits how many characters back it will look, which you don't strictly need when matching a fixed word, but may be helpful on a more general match.
One interesting test case for this one is a match that has repetitions, such as fofo, and a text that includes repetitions of those, such as fofofo or fofofofo.
In fact, testing on those made me see that the pattern above will actually prefer to match the second occurrence in fofofo rather than the first one, if that's the fourth occurrence of fofo in that line. That's because the * operator is greedy. We can fix that by using {-} instead, which matches the shortest sequence possible.
Fixing that bug, we get:
/\v^%(%(.%(foo)#<!){-}\zsfoo){4}
Which is general enough and you can probably use with any fixed word, or even a pattern with a few variations (e.g. case, plurals, alternative spellings, etc.)

Negative lookbehind in regex

(Note: not a duplicate of Why can't you use repetition quantifiers in zero-width look behind assertions; see end of post.)
I'm trying to write a grep -P (Perl) regex that matches B, when it is not preceded by A -- regardless of whether there is intervening whitespace.
So, I tried this negative lookbehind, and tested it in regex101.com:
(?<!A)\s*B
This causes "AB" not to be matched, which is good, but "A B" does result in a match, which is not what I want.
I am not exactly sure why this is. It has something to do with the fact that \s* matches the empty string "", and you can say that there are, as such, infinity matches of \s* between A and B. But why does this affect "A B" but not "AB"?
Is the following regex a proper solution, and if so, why exactly does it fix the problem?
(?<![A\s])\s*B
I posted this before and it was incorrectly marked as a duplicate question. The variable-length thing I'm looking for is part of the match, not part of the negative lookbehind itself -- so this quite different from the other question. Yes, I could put the \s* inside the negative lookbehind, but I haven't done so (and doing so is not supported, as the other question explains). Also, I am particularly interested in why the alternate regex I post above works, since I know it works but I'm not exactly sure why. The other question did not help answer that.
But why does this affect "A B" but not "AB"?
Regexes match at a position, which it is helpful to think of as being between characters. In "A B" there is a position (after the space and before the B) where (?<!A) succeeds (because there isn't an A immediately preceding; there's a space instead), and \s*B succeeds (\s* matches the empty string, and B matches B), so the entire pattern succeeds.
In "AB" there is no such position. The only place where \s*B can match (immediately before the B), is also immediately after the A, so (?<!A) cannot succeed. There are no positions that satisfy both, so the pattern as a whole can't succeed.
Is the following regex a proper solution, and if so, why exactly does it fix the problem?
(?<![A\s])\s*B
This works because (?<![A\s]) will not succeed immediately after an A or after a space. So now the lookbehind forbids any match position that has spaces before it. If there are any spaces before the B, they have to be consumed by the \s* portion of the pattern, and the match position must be before them. If that position also doesn't have an A before it, the lookbehind can succeed and the pattern as a whole can match.
This is a trick that's made possible by the fact that \s is a fixed-width pattern that matches at every position inside of a non-empty \s* match. It can't be extended to the general case of any pattern between the (non-)A and the B.

Efficient way to match regex between delimiters

I have a string and want to match the substring between the two first delimiters with a regular expression.
For example a string foo"text"bar anotherfoo"anothertext"anotherbar with delimiter " should yield text.
I found the following possible solutions:
Non-greedy matching "(.*?)"
Non-greedy matching with Lookahead and Lookbehind assertions (?<=")(.*?)(?=")
Negated character classes "([^"]*)"
Which one is the most efficient way of doing this? Or am I missing cases where these solutions behave differently (assuming the new line modifier is set so that a dot matches a new line)?
Since the delimiters are single characters, and the matched substring should not contain them, the negated character class solution ("([^"]*)") is the most efficient.
If you want to match only once, you do not even need the closing ": just use "([^"]*).
The lazy dot matching ("(.*?)") technique might cause performance issues when there is no ending delimiter and the text is rather large after the initial delimiter.
Lookarounds almost always involve additional overhead of checking for some subpatterns at each tested position. Since the delimiters here are single characters, the lookbehind/lookahead here are not efficient. You only want to use this solution if there is no way to access capturing groups. In Python, capturing works well, so no need using this solution.

Why is a character class faster than alternation?

It seems that using a character class is faster than the alternation in an example like:
[abc] vs (a|b|c)
I have heard about it being recommended and with a simple test using Time::HiRes I verified it (~10 times slower).
Also using (?:a|b|c) in case the capturing parenthesis makes a difference does not change the result.
But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation's nature of alternation?
This is because the "OR" construct | backtracks between the alternation: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:
Pattern: (r|f)at
Match string: carat
Pattern: [rf]at
Match string: carat
But to be short, the fact that pcre engine optimizes this (single literal characters -> character class) away is already a decent hint that alternations are inefficient.
Because a character class like [abc] is irreducable and can be optimised, whereas an alternation like (?:a|b|c) may also be (?:aa(?!xx)|[^xba]*?|t(?=.[^t])t).
The authors have chosen not to optimise the regex compiler to check that all elements of an alternation are a single character.
There is a big difference between "check that the next character is in this character class" and "check that the rest of the string matches any one of these regular expressions".

Regular Expression Opposite

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)