How to search and replace from the last match of a until b? - regex

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)

I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)

You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Related

Backreference Bug in Notepad++ Non-Consecutive Line Duplication Finder Regex

There seems to be a bug in the Notepad++ find/replace behaviour when using a backreference to find duplicate lines that may not necessarily be consecutive. I'm wondering if anyone knows what the issue could be with the regex or if they know why the regex engine might be bugging out?
Details
I wanted to use a regex to find duplicate lines in Notepad++. The duplicates needn't necessarily be contiguous i.e. on consecutive lines, there can be lines in between. I started with this post:
https://medium.com/#heitorhherzog/compare-sort-and-delete-duplicate-lines-in-notepad-2d1938ed7009
But realised that the regex mentioned there only checks for contiguous duplicates. So I wrote my own regex:
^(.+)$(?:(?:\s|.)+)^(\1)$
The above basically captures something on a whole line, then matches a load of stuff in between, then captures the same thing about on a line.
What's wrong
The regex works, but only sometimes. I can't figure out the pattern. I've whittled it down to this so far. If I do a "Replace All" on the replacement pattern \1\2 then the "replace all" leaves me with just line 3, which is "elative backreferences32". This is wrong:
dasfdasfdsfasdfasdfadsfasdf
elative backreferenceswe
elative backreferences32
elative backreferencesd
elative backreferencdesdfdasdfsdafsd
asfasdfasdfasdfasdfasfdsaasdfas
asdfasdfafds asdfasfdsafasd asdfdasfsd
elative backreferencessfhdfg
x
y
x
But if I delete any line from that file, then only the consecutive lines x then y then x are replaced by a single line xx as I'd expect.
Notes
I'd like to keep this question focused mostly on why the regex is
bugging out. Suggestions about alternative ways to find duplicate
lines are of course good but the main reason I'm asking this is to
figure out what's going on with the regex and Notepad++.
I don't really need the replace part of this, just the find, I was just using the replace to try to figure out what groups were being captured in an attempt to debug this
The find behaviour is also buggy. I noticed this first actually. It first finds the match I'm actually looking for, and then if I click "Find Next" again, it highlights all the text.
Hypotheses
There is a bug in Notepad++ v7.8.4 64 bit. I just updated today so maybe they haven't caught it yet.
Does the in-between part of the match, (?:(?:\s|.)+), maybe cycle
around past the end of file character and loop right back to the
original match? If so, I'd say that's still a bug, because AFAIK a
regex should only consume each character once.
I thought there might be a limit to the number of characters in the file, but I disproved this hypothesis by playing around with the file, adding characters here and there. Two files with the same number of lines and the same number of characters can behave differently: one with buggy behaviour, one without.
Screenshots
Before
After Without Matches Newline (The intended configuration)
After With Matches Newline (for Experimentation)
(?:\s|.) should be avoid as it causes unexpected behaviour, I suggest using [\s\S] instead:
Find what: ^(.+)$[\s\S]+?^(\1)$
Replace with: $1$2

Vim S&R to remove number from end of InstallShield file

I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES&gt
PATH_TO_MY_FILES1&gt
...
PATH_TO_MY_FILES22&gt
PATH_TO_MY_FILES_ELSEWHERE&gt
PATH_TO_MY_FILES_ELSEWHERE1&gt
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)&gt:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)&gt:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22&gt to get replaced with PATH_TO_MY_FILES2&gt. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22&gtgt
So basically, it looks like it finds PATH_TO_MY_FILES22&gt, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)&gt
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.

RegExp adaption with new line

I've the following RegExp to find the URIs listed above:
"^w{3}\.[\S\-\n|\S]+[^\s.!?,():]+$"
URLs to find:
www.example.org
www.example-example.org
www.example-example.org/product
You'll find it at www.example-
example.org/product.
www.example.org
You'll find it there.
Number 1, 2 and 3 will be found, but 4. delivers "www.example-" as URI.
When there is no point at the end of 4. it would deliver it correct.
EDIT: With deleting ^ and $ only number 5 is not working.
Does anyone can help here?
Your pattern
^w{3}\.[\S\-\n|\S]+[^\s.!?,():]+$
can be simplified to
^w{3}\.[\S\n]+[^\s.!?,():]$
[\S\-\n|\S] this is a character class, no OR possible, no repetition needed, - is included in \S. So [\S\n] is doing the same.
[^\s.!?,():]+ because you match every non whitespace with the expression before this one, here the + is not needed. I assume you just want your pattern not to end with one of the characters from the class.
See your pattern on Regexr (I added \r to your first class, because the line breaks there needs it)
This is a very useful tool to test regexes
I think your problem is that you want to allow line breaks in the link. How do you want to handle this? How do you want to distinguish when the line ends with a link if the word in the next line is just a word or part of the link. I think this is not possible!
The problem is the '^\s' in the second squared bracketed part. Depending on your programming language, '\s' might match the new line. So, you are telling it to match anything that is not a whitespace and it finds a whitespace (new line).
However, this should only be one of your issues. Your regex uses the '^' and '$' characters which mean start and end of line respectively. Try this URL example:
hello from www.example.org
Did it match? I think it will not.

Regular Expression Using the Dot-Matches-All Mode

Normally the . doesn't match newline unless I specify the engine to do so with the (?s) flag. I tried this regexp on my editor's (UltraEdit v14.10) regexp engine using Perl style regexp mode:
(?s).*i
The search text contains multiple lines and each line contains many 'i' characters.
I expect the above regexp means: search as many characters (because with the '?s' the . now matches anything including newline) as possible (because of the greediness for *) until reaching the character 'i'.
This should mean "from the first character to the last 'i' in the last sentence" (greediness should reach the last sentence, right?).
But with UltraEdit's test, it turns out to be "from the first character to the last 'i' in the first sentence that contains an i". Is this result correct? Did I make any wrong interpretation of my reg expression?
e.g. given this text
aaa
bbb
aiaiaiaiaa
bbbicicid
it is
aaa
bbb
aiaiaiai
matched. But I expect:
aaa
bbb
aiaiaiaiaa
bbbicici
Your regex is correct, and so are your expectations of its performance.
This is a long-known bug in UltraEdit's regex implementation which I have written repeatedly to support about. As far as I know, it still hasn't been fixed. The problem appears to lie in the fact that UE's regex implementation is essentially line-based, and additional lines are taken into the match only if necessary. So .* will match greedily on the current line, but it will not cross a newline boundary if it doesn't have to in order to achieve a match.
There are some other subtle bugs with line endings. For example, lookbehind doesn't work across newlines, either.
Write to IDM support, or change to an editor with decent regex support. I did both.
Yes you are right this looks like a bug.
Your interpretation is correct. If you are in Perl mode and not Posix.
However it should apply to posix as well.
Altough defining the modifiers like you do is very rare.
Mostly you provide a string with delimiters and the modifier afterwards like /.*i/s
But this doesn't matter because your way is correct too. And if it wouldnt be supported, it wouldn't match the first newline either.
So yes, this is definately a bug in your program.
You're right that that regex should match the entire string (all 4 lines). My guess is that UltraEdit is attempting to do some sort of optimization by working line by line, and only accumulating new lines "when necessary".

how to eliminate dots from filenames, except for the file extension

I have a bunch of files that look like this:
A.File.With.Dots.Instead.Of.Spaces.Extension
Which I want to transform via a regex into:
A File With Dots Instead Of Spaces.Extension
It has to be in one regex (because I want to use it with Total Commander's batch rename tool).
Help me, regex gurus, you're my only hope.
Edit
Several people suggested two-step solutions. Two steps really make this problem trivial, and I was really hoping to find a one-step solution that would work in TC. I did, BTW, manage to find a one-step solution that works as long as there's an even number of dots in the file name. So I'm still hoping for a silver bullet expression (or a proof/explanation of why one is strictly impossible).
It appears Total Commander's regex library does not support lookaround expressions, so you're probably going to have to replace a number of dots at a time, until there are no dots left. Replace:
([^.]*)\.([^.]*)\.([^.]*)\.([^.]*)$
with
$1 $2 $3.$4
(Repeat the sequence and the number of backreferences for more efficiency. You can go up to $9, which may or may not be enough.)
It doesn't appear there is any way to do it with a single, definitive expression in Total Commander, sorry.
Basically:
/\.(?=.*?\.)//
will do it in pure regex terms. This means, replace any period that is followed by a string of characters (non-greedy) and then a period with nothing. This is a positive lookahead.
In PHP this is done as:
$output = preg_replace('/\.(?=.*?\.)/', '', $input);
Other languages vary but the principle is the same.
Here's one based on your almost-solution:
/\.([^.]*(\.[^.]+$)?)/\1/
This is, roughly, "any dot stuff, minus the dot, and maybe plus another dot stuff at the end of the line." I couldn't quite tell if you wanted the dots removed or turned to spaces - if the latter, change the substitution to " \1" (minus the quotes, of course).
[Edited to change the + to a *, as Helen's below.]
Or substitute all dots with space, then substitute [space][Extension] with .[Extension]
A.File.With.Dots.Instead.Of.Spaces.Extension
to
A File With Dots Instead Of Spaces Extension
to
A File With Dots Instead Of Spaces.Extension
Another pattern to find all dots but the last in a (windows) filename that I've found works for me in Mass File Renamer is:
(?!\.\w*$)\.
I don't know how useful that is to other users, but this page was an early search result and if that had been on here it would have saved me some time.
It excludes the result if it's followed by an uninterrupted sequence of alphanumeric characters leading to the end of the input (filename) but otherwise finds all instances of the dot character.
You can do that with Lookahead. However I don't know which kind of regex support you have.
/\.(?=.*\.)//
Which roughly translates to Any dot /\./ that has something and a dot afterwards. Obviously the last dot is the only one not complying. I leave out the "optionality" of something between dots, because the data looks like something will always be in between and the "optionality" has a performance cost.
Check:
http://www.regular-expressions.info/lookaround.html