Vim multiline regex gives overlapping matches - regex

I was surprised when I noticed that my greedy multiline regex was giving overlapping matches in Vim. The regex is designed to match an entire block of text, or consecutive non-blank lines.
The regex apparently matched everything I expected it to (highlight looked correct), but when using n to skip to the next match instead of skipping to the next block, it went to the next line in the current block.
Here is the regex I was using (equivalent to (.+\n){1,} for most regex engines):
\(.\+\n\)\{1,}
This should match at least one non-empty line, and as many consecutive non-empty lines as possible, here is an example text file:
block 1
some stuff
more stuff
block 2
foo bar
baz qux
After applying this regex (/\(.\+\n\)\{1,}+Enter) the two blocks are highlighted correctly, but I expect there to be only two matches of the regex, one for each block. However when I press n to advance to the next regex match it appears that each non-empty line matches the regex, so my cursor would start on the first line, n would take it to the second line, then third, then to the start of block 2 etc.
How can I change my regex so that I see the expected behavior of each block being a single match so that n advances to the next block, instead of the next line?
I am also interested in knowing if this behavior is in the documentation somewhere, or if there is an option to change this behavior. Note that when using the same regex in a search/replace the behavior is what I expect (replacement would only be applied twice, once for each block).

The following regex seems to work:
\(\%^\|^\n\)\zs\(.\+\n\)\+
Explanation:
\( # start of group
\%^ # beginning of file
\| # OR
^\n # a blank line
\) # end of group
\zs # start matching here
\(.\+\n\)\+ # at least one non-blank line
By using the very magic option the length can be reduced a bit:
\v(%^|^\n)\zs(.+\n)+
Looking forward to seeing if anyone can come up with a shorter solution!
zigdon's answer helped me to understand better why the behavior is the way it is. When n is used to jump to the next match it searches for the first match of the regex from the cursor's current position, even if the next matching position was included in the previous match. This is why anchoring the regex to the start of the block appears to be necessary.
Thanks to Nolen Royalty for helping me get rid of an unnecessary lookahead in the first group.

Since your match says "match one or more non-empty lines" it can certainly match multiple times within the same paragraph. To fix this, you can specify that the cursor should be placed at the end of the match - the means the next match will start from the end of the paragraph. You can do this with the \zs zero-width character, available in vim:
\zs Matches at any position, and sets the start of the match there: The
next char is the first char of the whole match. |/zero-width|
So your match will become:
\(.\+\n\)\{1,}\zs

Related

Using a regex to append to the end of non-blank lines

I would think that this would be a common question, but I can't find anybody asking how to do this. There are people asking how to do the opposite (find blank lines) and add a <br><br> at the end of each one. For human readability, this document has blank lines between paragraphs.
(I don't want to replace the blank lines with <br><br>. I know this would achieve the same result, but for human readability and personal preference, I don't like how this makes the document one giant block of text.)
How can I write a regex that captures -- I don't know if this is the right word to use; maybe "groups"? -- the end of lines that aren't blank so that I can append to the end of them?
I am using Visual Studio code, so I'd like this to work in the search/replace box:
I'm assuming in the replacement box above, I'd need to say $some group number(s?), so I just said $x as a temporary placeholder. Here's what I've tried as search patterns:
^(?!:($))$
^(?!:(\S$))$
^(?!:([^\S]$))$
^(?!:([^\s]$))$
^(?!([^\S]+))$
All of these seem to grab the inverse of what I'm trying to find. I guess my strategy has been, between the beginning and end of the line, there shouldn't be only whitespace. But I'm pretty sure that's not what I'm saying.
You can use
Find What:      (\S)[^\S\n]*(\n)
Replace With: $1<br><br>$2
NOTE: The above replacement will not add the <br>s at the end of the last line if it is not blank. If you need that, use
Find What:      (\S)[^\S\n]*$
Replace With: $1<br><br>
See the regex demo. The regex above matches the last non-whitespace char on a line (capturing it in Group 1 to keep it), then matches horizontal whitespace (if any) and then captures a line break that is also captured to keep in the output.
Details
(\S) - Group 1: any non-whitespace char
[^\S\n]* - zero or more horizontal whitespace chars
(\n) - Group 2: line break.
$ - end of a line (note that m flag (in its PCRE meaning) is always on, by default, in VSCode regex).
The replacement is $1<br><br>$2, Group 1 value + <br><br> + Group 2 value (if you use the first regex).
is changed into
This works to retain the spaces at the end of lines:
Find: (?<=^.*)(\S+.*)
Replace: $1<br><br>

What regular expression will select all lines that have more than one punctuation mark?

I have this regular expression:
\..*?\.
But it only selects between two periods, not every punctuation mark, and it also selects across multiple lines.
Would modifying this expression to only take in one line at a time work somehow, if there's also a way to group punctuation into where we have a period?
Just to make things simpler, at this time I only need the expression to recognize periods, exclamation points, and question marks. I don't need it to register commas.
Thanks to Nathan and Agumander below, I know to substitute [.!?] in place of \. now, but I'm still having trouble with the other half of my question.
Just to make sure I'm being more clear, using [.!?].*?[.!?]\s will highlight text between punctuation marks, but across multiple lines. So I can't use it to bookmark only the lines that have multiple punctuation marks.
Placing characters inside a pair of square brackets will match to any of the enclosed characters. In your case you'd want [.?!]
If you want to match any sentence that has two of these, then you'll be looking for a pair of [.!?] separated by zero or more of any character.
The regex that matches strings with more than one of the set [.?!] would then be [.!?].*[.!?]
To make . match newlines, you'd add the s modifier to your regex.
...so the full regex would be /[.!?].*[.!?]/s
Ok I figured it out. Thanks to Agumander and Nathan above I substituted [.!?] in for the two \. in my original regex:
\..*?\. became [.!?].*[.!?]
Putting \s at the end of the regex made it pink select the entire document in notepad++.
The last issue I had was remembering to turn off "matches newline."
Agumander, I think you're asking for a regex that basically finds multiple punctuation marks on a single line. So here's one way to do it.
Here's the text I'm going to match. The regex will match the first line in it's entirety, but will not match the second.
Here's a line with multiple punctuation. The entire line will match the regex!
This line does not have multiple punctuation.
Regex
^.*(?:[\.?!].*){2,}$
Explanation
^ -- Start matching at the beginning of a line
.* -- match any character 0 or more times
(?: -- start a new non-capturing group
[.?!] -- find a character matching a period, question mark, or exclamation point.
.* -- match any character 0 or more times
)
{2,} -- repeat the previous group 2 or more times. This is how we ensure there's at least two punctuation marks before considering it a match.
$ -- end of line anchor, basically stop matching at the end of a line

Find lines with same characters set

I have situation like this.
Car Driver
Cat Mouse
Door House
Driver Car
I need help with regex to find all lines with same set of characters or words no mater how placed in line.
Car Driver
Driver Car
Edited list:
A0JLS3 Q9NUA2 <
A0JLT2 Q9Y3C7
A0N0L5 P26441
A0N0Q1 O00626
A0N0Q1 P35626
A0PJF8 P27361
Q9NUA2 A0JLS3 <
EDIT: after taking a look at your file, it seems that there is one tab character after the first word and a variable number of tab characters after the second, so you must change the pattern to:
^(\w+)\h+(\w+)\h*$(?=(?>\R.*)*?\R(?:\1\h+\2|\2\h+\1)\h*$)
where \h stand for an horizontal white-character.
Since you seems to have huge files and I don't see how to not use a reluctant quantifier in the lookahead assertion, you can try to use this modified pattern where all the quantifiers are possessive (when possible), and all groups are atomic. It seems to be a little faster:
^(\w++)\h++(\w++)\h*+$(?=(?>\R.*+)*?\R(?>\1\h++\2|\2\h++\1)\h*+$)
Previous answer:
You can use this pattern:
^(\w+) (\w+)$(?=(?>\R.*)*?\R(?:\1 \2|\2 \1)$)
This will find lines that have a "duplicate line" with the two same words after in the text. If you want to use it to remove duplicate, keep in mind that this will preserve the last occurence and remove the first.
pattern details:
^(\w+) (\w+)$ : this describes a whole line (note the anchors for start ^ and end $ of the line) and put each word in a capturing group (group 1 and group 2)
The second part of the pattern checks if there is a "similar line" (a line with the same words) after. Since it is embeded in a lookahead assertion ((?=...) i.e. followed by), this part isn't included in the match result.
(?>\R.*)*?: lines until the duplicate. \R stand for CRLF or LF, and .* match all characters except newlines. The group is repeated with a lazy quantifier to stop before the first duplicate line. (note that this works with a greedy quantifier too, the best choice depends on how looks your document. For example, if duplicates are often at the end of the document, using a greedy quantifier is a better choice)
(?:\1 \2|\2 \1) describes the two possibilities using backreferences to group 1 and 2.
$ is added to ensure that the last word is whole. (otherwise something like A0N0L5 P26441 ... A0N0L5 P26441XXX will succeed)
I'm not sure exactly what you are trying to achieve. If you're looking for all lines containing both of the words Car and Driver, you can mark all lines containing this regular expression:
Car Driver|Driver Car
Here's a guide on regular expressions in Notepad++: http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions
And consider taking a look at the Stack Overflow Regular Expressions FAQ for some more useful information.

How can I delete everything that is not matched by a pattern

I have a text document in Notepad++ with information separated by line, but want to delete everything but every seventh line. This line is always matched by the pattern (\d{4} :.*?\r\n).
How can I delete everything that does not match this pattern so that I just get every seventh line separated by \r\n?
You could maybe try:
^(?!\d{4} :)[\s\S]*?(?=\r\n\d{4} :)
regex101 demo
[Note, I couldn't put \r in there because I couldn't insert carriage returns in the input box somehow...]
^ is a beginning of line anchor and matches the beginning of a line.
(?!\d{4} :) is a negative lookahead and will make the whole regex match only if there's no \d{4} : at the beginning of the line (the position being indicated by ^).
[\s\S]*? is a character class that will match any and all character. The quantifier is a lazy quantifier that will cause matching to stop as soon as possible (this is determined by what's following)
(?=\r\n\d{4} :) is a positive lookahead, and matches only when there's a \r\n\d{4} : ahead.
If I understood your question well, this would be what you're looking for. All lines except the 7th lines get deleted and there's only one empty line left behind between each of those 7th line.
Open the search dialogue and select the Mark tab. In the Find what field enter a search string to find the lines to be kept. Make sure that Bookmark line and Regular expression are selected, then click Mark all. Next visit the menu => Search => Bookmark => Remove unmarked lines.
The question says the lines to be retained match (\d{4} :.*?\r\n). The capture brackets ( and ) are not needed as the capture is not used. Searches for \r\n may often be rewritten as searching for $, ie an end-of-line. Your search pattern is just looking for the first end-of-line after the earlier items. The search may be reduced to \d{4} :.

Matching a line without either of two words

I was wondering how to match a line without either of two words?
For example, I would like to match a line without neither Chapter nor Part. So neither of these two lines is a match:
("Chapter 2 The Economic Problem 31" "#74")
("Part 2 How Markets Work 51" "#94")
while this is a match
("Scatter Diagrams 21" "#64")
My python-style regex will be like (?<!(Chapter|Part)).*?\n. I know it is not right and will appreciate your help.
Try this:
^(?!.*(Chapter|Part)).*
#MRAB's solution will work, but here's another option:
(?m)^(?:(?!\b(?:Chapter|Part)\b).)*$
The . matches one character at a time, after the lookahead checks that it's not the first character of Chapter or Part. The word boundaries (\b) make sure it doesn't incorrectly match part of a longer word, like Partition.
The ^ and $ are start- and end anchors; they ensure that you match a whole line. $ is better than \n because it also matches the end of the last line, which won't necessarily have a linefeed at the end. The (?m) at the beginning modifies the meaning of the anchors; without that, they only match at the beginning and end of the whole input, not of individual lines.