Backreference Bug in Notepad++ Non-Consecutive Line Duplication Finder Regex - regex

There seems to be a bug in the Notepad++ find/replace behaviour when using a backreference to find duplicate lines that may not necessarily be consecutive. I'm wondering if anyone knows what the issue could be with the regex or if they know why the regex engine might be bugging out?
Details
I wanted to use a regex to find duplicate lines in Notepad++. The duplicates needn't necessarily be contiguous i.e. on consecutive lines, there can be lines in between. I started with this post:
https://medium.com/#heitorhherzog/compare-sort-and-delete-duplicate-lines-in-notepad-2d1938ed7009
But realised that the regex mentioned there only checks for contiguous duplicates. So I wrote my own regex:
^(.+)$(?:(?:\s|.)+)^(\1)$
The above basically captures something on a whole line, then matches a load of stuff in between, then captures the same thing about on a line.
What's wrong
The regex works, but only sometimes. I can't figure out the pattern. I've whittled it down to this so far. If I do a "Replace All" on the replacement pattern \1\2 then the "replace all" leaves me with just line 3, which is "elative backreferences32". This is wrong:
dasfdasfdsfasdfasdfadsfasdf
elative backreferenceswe
elative backreferences32
elative backreferencesd
elative backreferencdesdfdasdfsdafsd
asfasdfasdfasdfasdfasfdsaasdfas
asdfasdfafds asdfasfdsafasd asdfdasfsd
elative backreferencessfhdfg
x
y
x
But if I delete any line from that file, then only the consecutive lines x then y then x are replaced by a single line xx as I'd expect.
Notes
I'd like to keep this question focused mostly on why the regex is
bugging out. Suggestions about alternative ways to find duplicate
lines are of course good but the main reason I'm asking this is to
figure out what's going on with the regex and Notepad++.
I don't really need the replace part of this, just the find, I was just using the replace to try to figure out what groups were being captured in an attempt to debug this
The find behaviour is also buggy. I noticed this first actually. It first finds the match I'm actually looking for, and then if I click "Find Next" again, it highlights all the text.
Hypotheses
There is a bug in Notepad++ v7.8.4 64 bit. I just updated today so maybe they haven't caught it yet.
Does the in-between part of the match, (?:(?:\s|.)+), maybe cycle
around past the end of file character and loop right back to the
original match? If so, I'd say that's still a bug, because AFAIK a
regex should only consume each character once.
I thought there might be a limit to the number of characters in the file, but I disproved this hypothesis by playing around with the file, adding characters here and there. Two files with the same number of lines and the same number of characters can behave differently: one with buggy behaviour, one without.
Screenshots
Before
After Without Matches Newline (The intended configuration)
After With Matches Newline (for Experimentation)

(?:\s|.) should be avoid as it causes unexpected behaviour, I suggest using [\s\S] instead:
Find what: ^(.+)$[\s\S]+?^(\1)$
Replace with: $1$2

Related

Notepad++ Regex Remove Character from Markdown Formatted Footnote

This is a follow-up question to what was solved yesterday:
Notepad++ Regex Replace Makeshift Footnotes format With Proper Markdown format
I managed to find a Regex to remove the offending semicolons in the main text area but by only cutting out the text and pasting back the result, which can only be done one by one.
I'm not sure how this can be done, but the expert can tell me.
So I have footnote references in markdown format. Two instances of the same thing:
[^1]:
[^2]:
.
.
.
[^99]:
I might not have 99 in a document but I wanted to show I need to match two digits here again.
As I said, there are two instances of these numbered references in the text. One in the main text pointing to the footnote and the footnote at the end of the document.
What I need is deleting the semi-colons from the main text and leave the
[^3]:
[^15]:
etc.
references at the end intact.
Because the main text references come after a word or at the end of a sentence (ususally before the sentence-ending period), there is never a case a reference would start a sentence (even if they seem to appear there once or twice because of word wrap).
I provided the exact opposite of my needs here:
Click here for Regex101 website link
I put in the exact opposite of what I want because I already knew of the
^
sign to match anything that is at the front of the line.
Now I would like to negate this, if possible, so that I would delete the semi-colons in the main text, not down at the bottom.
Of course, it is likely that my approach is not good and you'll come up with a completely different approach. Especially because there doesn't seem to be a NOT operator in Regex, if I read correctly.
I repeat: the Regex101 example with the match and substitution is exactly the opposite of what I want.
I am not sure if you can play around in the substitution line to get the desired negative effect.
I could have probably asked for removing the first occurence of semi-colons but I thought the important part of tackling the problem is that those items not to be matched are always at the start of the line, not the others.
Thanks for any suggestions
In Notepad++ you might use a negative lookabehind asserting not the start of the string to the left, and use \K to clear the match buffer matching only the colon that should be replaced by an empty string.
(?<!^)\[\^\d{1,2}]\K:
Explanation
(?<!^) Negative lookbehind, assert not the start of the start directly to the left
\[\^ Match [^
\d{1,2} Match 1 or 2 digits
] Match literally
\K Forget what is matched so far
: Match a colon
Regex demo

RegEx for underlining text

How can I match one line of text with a regex and follow it up with a line of dashes exactly as many as characters in the initial match to achieve text-only underlining. I intend to use this with the search and replace function (likely in the scope of a macro) inside an editor. Probably, but not necessarily, Visual Studio Code.
This is a heading
should turn into
This is a heading
-----------------
I believe I have read an example for that years ago but can't find it; neither do I seem to be able to formulate a search query to get anything useful out of Google (including variations of the question's title). If you are I'd be interested in that, too.
The best I can come up with is this:
^(.)(?=(.*\n?))|.
Substitution
$1$2-
syntax
note
^(.)
match the first character of a line, capture it in group 1
(?=(.*\n?))
then look ahead for the rest of this line and capture it in group 2, including a line break if there's any
|.
or a normal character
But the text must has a line break after it, or the underline only stays on the same line.
Not sure if it is any useful but here are the test cases.

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Vim S&R to remove number from end of InstallShield file

I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES&gt
PATH_TO_MY_FILES1&gt
...
PATH_TO_MY_FILES22&gt
PATH_TO_MY_FILES_ELSEWHERE&gt
PATH_TO_MY_FILES_ELSEWHERE1&gt
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)&gt:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)&gt:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22&gt to get replaced with PATH_TO_MY_FILES2&gt. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22&gtgt
So basically, it looks like it finds PATH_TO_MY_FILES22&gt, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)&gt
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.

Matching line without and with lower-case letters

I want to match two consecutive lines, with the first line having no lower-case letter and the second having lower-case letter(s), e.g.
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
Why would the Regex ^(?!.*[:lower:]).*$\n^(.*[:lower:]).*$ match each of the following two-line examples?
("1.3.3 Disks 24" "#52")
("1.3.4 Tapes 25" "#53")
("1.5.4 Input/Output 41" "#69")
("1.5.5 Protection 42" "#70")
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
Thanks and regards!
ADDED:
For a example such as:
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
How shall I match only the middle two lines not the first three lines or all the four lines?
To use a POSIX "character class" like [:lower:], you have to enclose it in another set of square brackets, like this: [[:lower:]]. (According to POSIX, the outer set of brackets form a bracket expression and [:lower:] is a character class, but to everyone else the outer brackets define a character class and the inner [:lower:] is obsolete.)
Another problem with your regex is that the first part is not required to consume any characters; everything is optional. That means your match can start on the blank line, and I don't think you want that. Changing the second .* to .+ fixes that, but it's just a quick patch.
This regex seems to match your specification:
^(?!.*[[:lower:]]).+\n(?=.*[[:lower:]]).*$
But I'm a little puzzled, because there's nothing in your sample data that matches. Is there supposed to be?
Using Rubular, we can see what's matched by your initial expression, and then, by adding a few excess capturing groups, see why it matches.
Essentially, the negative look-ahead followed by .* will match anything. If you merely want to check that the first line has no lower-case letters, check that explicitly, e.g.
^(?:[^a-z]+)$
Finally, I'd assuming you want the entire second line, you can do this for the second part:
^(.*?(?=[:lower:]).*?)$
Or to match your inital version:
^(.*?(?=[:lower:])).*?$
The reluctant qualifiers (*?) seemed to be necessary to avoid matching across lines.
The final version I ended up with, thus, is:
^(?:[^a-z]+)$\n^(.*?(?=[:lower:]).*?)$
This can be seen in action with your test data here. It only captures the line ("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205").
Obviously, the regex I've used might be quite specific to Ruby, so testing with your regex engine may be somewhat different. There are many easily Google-able online regex tests, I just picked on Rubular since it does a wonderful job of highlighting what is being matched.
Incidentally, if you're using Python, the Python Regex Tool is very helpful for online testing of Python regexes (and it works with the final version I gave above), though I find the output visually less helpful in trouble-shooting.
After thinking about it a little more, Alan Moore's point about [[:lower:]] is spot on, as is his point about how the data would match. Looking back at what I wrote, I got a little too involved in breaking-down the regex and missed something about the problem as described. If you modify the regex I gave above to:
^(?:[^[:lower:]]+)$\n^(.*?(?=[[:lower:]]).*?)$
It matches only the line ("3.3.1 Paging 187" "#215"), which is the only line with lowercase letters following a line with no lowercase letters, as can be seen here. Placing a capturing group in Alan's expression, yielding ^(?!.*[[:lower:]]).+\n((?=.*[[:lower:]]).*)$ likewise captures the same text, though what, exactly, is matched is different.
I still don't have a good solution for matching multiple lines.