How do you search/replace the nth occurrence in vim visual mode? - regex

This works:
'<,'>s/\v\/\zs(\/)//
'<,'>s/\v(\/)#<=\//BAR/
I was just wondering if there was an easier way to replace the nth occurrence with a {} or something in vim.
Replace the third forward slash '/'.
/dir1//fas//fooBar/¬
/dir2//\.foobar//fas/¬
/dir//.foo//fas/¬
How would I replace the fourth 'foo' ?
foo foo foo foo foo foo foo
foo foo foo foo foo foo foo

I'll discuss matching these patterns for simplicity, replacing them in a substitute command should work the same, using the same pattern on the :s command.
Replace the third forward slash '/'.
With a one-character match this is easier, since you can use [^/] to find characters that are not part of the match.
If you want to count matches, you need to start from the beginning of the line, so anchor with ^.
At that point, you can match two instances of "not slashes" followed by a "slash", and then on the third one you can use a \zs to mark it as the start of the actual match.
It's a bit unfortunate that / itself will need to be escaped with \/ if we use it on a match, but the resulting pattern is:
/\v^%([^\/]*\/){2}[^\/]*\zs\/
One common tip for patterns that include / is to search backwards using ? instead, so let's do that to improve readability:
?\v^%([^/]*/){2}[^/]*\zs/
The patterns pattern items I used here that might be unfamiliar to some are:
%(...): Groups a pattern, same as (...) but doesn't create a capture group.
{2}: Matches the preceding pattern exactly twice.
Remember we're using "verymagic" with \v, so most of the above won't require backslashes.
There's a neat shortcut we can take to shorten the pattern above (and that will help us when we look at the case of the longer word), which is that if you have \zs in multiple places in your pattern, then the last one to match will be the one that will define the actual start of the match. (See :help /\zs.)
So we can simplify that to:
?\v^%([^/]*\zs/){3}
We match "not slashes" followed by a "slash" three times. The \zs will only take effect on the last (third) match, so you'll end up matching the third slash on the line.
Now let's move on to the more complicated case of matching a word:
How would I replace the fourth 'foo' ?
Here we can't use [^...] to match "not foo". I mean, we could use something like \v([^f]|f[^o]|fo[^o]) but that grows quickly as the word you're matching grows. And there's a better way to do it.
We can use a zero-width negative look-behind! See :help /\#<! for this interesting operator. In short, it takes the preceding atom (we'll use a group with the word here) and makes sure that that item does not match ending at that location.
So we can use this:
/\v^%(%(.%(foo)#<!)*\zsfoo){4}
The %(foo)#<! here ensures that each . we match will not be the last o in foo. That way we can accurately count the first, second, third and fourth foo on the line and make sure we won't match the fifth, sixth or seventh.
Here again we're using the trick of repeating it four times (to find the fourth match) and having the last \zs stick.
Note that the negative look-behind works well with a fixed word, but if you start having multis such as * or + etc. then things get a lot more complicated. Take a look at the help for the operator and the warnings that it can be slow. There are also a variant of the operator that limits how many characters back it will look, which you don't strictly need when matching a fixed word, but may be helpful on a more general match.
One interesting test case for this one is a match that has repetitions, such as fofo, and a text that includes repetitions of those, such as fofofo or fofofofo.
In fact, testing on those made me see that the pattern above will actually prefer to match the second occurrence in fofofo rather than the first one, if that's the fourth occurrence of fofo in that line. That's because the * operator is greedy. We can fix that by using {-} instead, which matches the shortest sequence possible.
Fixing that bug, we get:
/\v^%(%(.%(foo)#<!){-}\zsfoo){4}
Which is general enough and you can probably use with any fixed word, or even a pattern with a few variations (e.g. case, plurals, alternative spellings, etc.)

Related

Why can "a*a+" and "(a{2,3})*a{2,3}" match "aaaa" while "(a{2,3})*" cannot?

My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary. Therefore, in a*a+, a* would give one (or maybe more?) character back to a+ so it can match.
However, in (a{2,3})*, why doesn't the first "instance" of a{2,3} gives a character to the second "instance" so the second one can match?
Also, in (a{2,3})*a{2,3} the first part does seem to give a character to the second part.
A simple workaround for your question is to match aaaa with regex ^(a{2,3})*$.
Your problem is that:
In the case of (a{2,3})*, regex doesn't seem to consume as much
character as possible.
I suggest not to think in giving back characters. Instead, the key is acceptance.
Once regex accept your string, the matching will be over. The pattern a{2,3} only matches aa or aaa. So in the case of matching aaaa with (a{2,3})*, the greedy engine would match aaa. And then, it can't match more a{2,3} because there is only one a remained. Though it's able for regex engine to do backtrack and match an extra a{2,3}, it wouldn't. aaa is now accepted by the regex, thus regex engine would not do expensive backtracking.
If you add an $ to the end of the regex, it simply tells regex engine that a partly match is unacceptable. Moreover, it's easy to explain the (a{2,3})*a{2,3} case with accepting and backtracking.
The main problem is this:
My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary
This is completely wrong. It is not what greedy means.
Greedy simply means "use the longest possible match". It does not give anything back.
Once you interpret the expressions with this new understanding everything makes sense.
a*a+ - zero or more a followed by one or more a
(a{2,3})*a{2,3} - zero or more of either two or three a followed by either two or three a (note: the KEY THING to remember is "zero or more", the first part not matching any character is considered a match)
(a{2,3})* - zero or more of either two or three a (this means that after matching three as the last single a left cannot match)
backtracking is done only if match fails however aaa is a valid match, a negative lookahead (?!a) can be use to prevent the match be followed by a a.
compare
(aaa?)*
and
(aaa?)*(?!a)

Regex (.*) without matching the second case

Given the following sample input text:
{{A1|def|ghi|jkl}}hello world. {{A2|mno}}bye world.
How can I create a regex pattern to only matching the first instance of {{ ... }} (i.e. only {{A1|def|ghi|jkl}}). A1 and A2 are fixed inputs and def, ghi, jkl, and mno could be anything.
I've tried this:
\{\{A1\|(.*)\|(.*)\|(.*)\}\}
But that returns everything ({{A1|def|ghi|jkl}}hello world. {{A2|mno}}).
Note that def or ghi or jkl or mno could be numbers, English letters or other languages (e.g. Chinese/Japanese/Korean).
It's a little unclear what you are trying to accomplish. At first, I thought that your problem was just that you were getting the entire thing when all you really wanted was the A1 or A2 part. If so, here's the answer:
Since you didn't specify which flavor of regex you are using, it's hard to say for sure. If you are using a version which supports look-arounds, you could do something like this:
(?<={{)\w+(?=(\|[^|}]*)+}})
Here's the meaning of the pattern:
(?<={{) - This is a positive look-behind expression which means that it asserts that any match must be preceded by certain characters. In this case, the characters are {{.
\w+ - This is the actual part that we are matching. In this case, it's one or more word characters. \w is a special character class. This varies, though, depending on which regex engine you are using. Something like [A-Z][0-9] may be more appropriate, depending on your needs.
(?=(\|[^|}]*)+}}) - This is a positive look-ahead expression. That means that it asserts that any match must be followed by some particular pattern of characters. In this case, it's looking for matches to be followed by (\|[^|}]*)+}}.
However, if look-arounds are not possible, then you can match it with a capturing group, like this:
{{(\w+)(\|[^|}]*)+}}
If you do it that way, you'll need to read the value of the first group for each match.
As far as only finding the first match goes, that really depends on which tool or language you are using. Most regex engines only find the first match by default and only find additional matches when a global modifier is specified (often /g at the end).
However, now, after having edited your question, and trying better to understand what you meant, I think that your real problem is greediness. The repetitions, such as *, in regex are greedy by default. That means they will capture as much text as they possibly can and still have it match. In this case, you don't want it to find the longest possible match. In this case, you want it to find the shortest possible match. You could do that simply by making the repetitions lazy (i.e. non-greedy). To do that, simply add a ? after the *. For instance:
\{\{A1\|(.*?)\|(.*?)\|(.*?)\}\}
However, that's not very efficient. If this pattern is going to be used often or on large inputs it would be better to use a more restrictive character class, such as [^}|] instead of ., so that the lazy modifier is unnecessary. For example:
\{\{A1\|([^}|]*)\|([^}|]*)\|([^}|]*)\}\}
Or, more simply:
{{A1(\|([^}|]*)){3}}}
The problem with your pattern is simply that you've made all of the * quantifiers greedy. They're matching as much of the string as they can (while still allowing the whole pattern to match). Just make them non-greedy *?:
\{\{A1\|(.*?)\|(.*?)\|(.*?)\}\}
https://regex101.com/r/pK4gE7/1

When to choose [^x]* or .*?

Assume i have a substring in a longer string like (...)aaabaacaaaaaXaaaadaeaa(...) and i want to match or replace the aaabaacaaaaa with the X as delimiter.
I can now use (.*?)X to find the string before the X or i can use ([^X]*) to find it. I could also use negative look-ahead but i don't think it is necessary in this case.
So which one of the two (or three) options is the better technique to get the group i want to match in this context?
Take this very simple example:
www\..*?\.com
www\.[^.]*\.com
The first one matches any input that contains a www. and a .com with anything in between. The second matches a www. and a .com that does not have a . in-between.
The first would match: www.google.something.com
The second would not.
Only use the negated class if that section absolutely cannot contain the character.
.*? is called lazy quantifier.
[^X]* is called greedy negation quantifier
Wherever possible use negation i.e. [^X] since it doesn't cause backtracking. Ofcourse if your input text can contain letter X then you have no choice but to use .*?
I am copying this text from one of the recent comment from #ridgerunner:
The expression: [^X)]* is certainly more efficient than .*? in
every language except possibly Perl (whose regex engine is highly
optimized for the lazy dot star expression). The expression .*? must
stop and backtrack once at every character position as it
"bumps-along", whereas the greedy quantifier applied to the negated
character class expression can consume the entire chunk in a single
step, with no backtracking.

RegEx - Exclude Matched Patterns

I have the below patterns to be excluded.
make it cheaper
make it cheapere
makeitcheaper.com.au
makeitcheaper
making it cheaper
www.make it cheaper
ww.make it cheaper.com
I've created a regex to match any of these. However, I want to get everything else other than these. I am not sure how to inverse this regex I've created.
mak(e|ing) ?it ?cheaper
Above pattern matches all the strings listed. Now I want it to match everything else. How do I do it?
From the search, it seems I need something like negative lookahead / look back. But, I don't really get it. Can some one point me in the right direction?
You can just put it in a negative look-ahead like so:
(?!mak(e|ing) ?it ?cheaper)
Just like that isn't going to work though since, if you do a matches1, it won't match since you're just looking ahead, you aren't actually matching anything, and, if you do a find1, it will match many times, since you can start from lots of places in the string where the next characters doesn't match the above.
To fix this, depending on what you wish to do, we have 2 choices:
If you want to exclude all strings that are exactly one of those (i.e. "make it cheaperblahblah" is not excluded), check for start (^) and end ($) of string:
^(?!mak(e|ing) ?it ?cheaper$).*
The .* (zero or more wild-cards) is the actual matching taking place. The negative look-ahead checks from the first character.
If you want to exclude all strings containing one of those, you can make sure the look-ahead isn't matched before every character we match:
^((?!mak(e|ing) ?it ?cheaper).)*$
An alternative is to add wild-cards to the beginning of your look-ahead (i.e. exclude all strings that, from the start of the string, contain anything, then your pattern), but I don't currently see any advantage to this (arbitrary length look-ahead is also less likely to be supported by any given tool):
^(?!.*mak(e|ing) ?it ?cheaper).*
Because of the ^ and $, either doing a find or a matches will work for either of the above (though, in the case of matches, the ^ is optional and, in the case of find, the .* outside the look-ahead is optional).
1: Although they may not be called that, many languages have functions equivalent to matches and find with regex.
The above is the strictly-regex answer to this question.
A better approach might be to stick to the original regex (mak(e|ing) ?it ?cheaper) and see if you can negate the matches directly with the tool or language you're using.
In Java, for example, this would involve doing if (!string.matches(originalRegex)) (note the !, which negates the returned boolean) instead of if (string.matches(negLookRegex)).
The negative lookahead, I believe is what you're looking for. Maybe try:
(?!.*mak(e|ing) ?it ?cheaper)
And maybe a bit more flexible:
(?!.*mak(e|ing) *it *cheaper)
Just in case there are more than one space.

What does ?: do in regex

I have a regex that looks like this
/^(?:\w+\s)*(\w+)$*/
What is the ?:?
It indicates that the subpattern is a non-capture subpattern. That means whatever is matched in (?:\w+\s), even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.
You're still looking for a specific pattern (in this case, a single whitespace character following at least one word), but you don't care what's actually matched.
It means only group but do not remember the grouped part.
By default ( ) tells the regex engine to remember the part of the string that matches the pattern between it. But at times we just want to group a pattern without triggering the regex memory, to do that we use (?: in place of (
Further to the excellent answers provided, its usefulness is also to simplify the code required to extract groups from the matched results. For example, your (\w+) group is known as group 1 without having to be concerned about any groups that appear before it. This may improve the maintainability of your code.
Let's understand by taking a example
In simple words we can say is let's for example I have been given a string say (s="a eeee").
Your regex(/^(?:\w+\s)(\w+)$/. ) will basically do in this case it will start with string finds 'a' in beginning of string and notice here there is 'white space character here) which in this case if you don't included ?: it would have returned 'a '(a with white space character).
If you may don't want this type of answer so u have included as*(?:\w+\s)* it will return you simply a without whitespace ie.'a' (which in this case ?: is doing it is matching with a string but it is excluding whatever comes after it means it will match the string but not whitespace(taking into account match(numbers or strings) not additional things with them.)
PS:I am beginner in regex.This is what i have understood with ?:.Feel free to pinpoint the wrong things explained.