vimscript E888: (NFA regexp) cannot repeat - regex

I am attempting to revive the cocoa.vim script.
Error detected while processing function objc#man#ShowDoc:
line 32:
E888: (NFA regexp) cannot repeat
Line 32 of the function objc#man#ShowDoc is:
let attrs = split(matchstr(line, '^ \zs*\S*'), '/')[:2]
First, I don't understand the error. What's repeating? What can't it repeat? Searching for that error online brings me to where it's defined in vim's source code, but it's obtuse enough that I don't understand it.
Second, I find it strange that this regexp used to work, but now it doesn't with newer vim.
I have very little vimscript experience and not much regexp experience. Guidance on where to look from those who do would be much appreciated. Here is the whole src if you're interested.

The problem is that \zs is the start of match regex atom. Repeating it zero or more times doesn't make any sense since you can always start the match at the point the \zs is.
This error was introduced vim patch 7.4.421 to stop a crash in vim when the NFA engine tried to generate the regex. Most likely the star shouldn't even be there. The old regex engine allowed it however I don't believe it did anything meaningful.
You should be able to fix it by just removing the star.
let attrs = split(matchstr(line, '^ \zs\S*'), '/')[:2]
(You might also try adding \%#=1 to the regex to force the old engine. You might want to read :h E888 to see if it says anything useful. I don't have a vim version with this patch level to test right now.)
The help for :h \zs is copied below.
/\zs
\zs Matches at any position, and sets the start of the match there: The
next char is the first char of the whole match. /zero-width
Example:
/^\s*\zsif
matches an "if" at the start of a line, ignoring white space.
Can be used multiple times, the last one encountered in a matching
branch is used. Example:
/\(.\{-}\zsFab\)\{3}
Finds the third occurrence of "Fab".
{not in Vi} {not available when compiled without the +syntax feature}

Related

Backreference Bug in Notepad++ Non-Consecutive Line Duplication Finder Regex

There seems to be a bug in the Notepad++ find/replace behaviour when using a backreference to find duplicate lines that may not necessarily be consecutive. I'm wondering if anyone knows what the issue could be with the regex or if they know why the regex engine might be bugging out?
Details
I wanted to use a regex to find duplicate lines in Notepad++. The duplicates needn't necessarily be contiguous i.e. on consecutive lines, there can be lines in between. I started with this post:
https://medium.com/#heitorhherzog/compare-sort-and-delete-duplicate-lines-in-notepad-2d1938ed7009
But realised that the regex mentioned there only checks for contiguous duplicates. So I wrote my own regex:
^(.+)$(?:(?:\s|.)+)^(\1)$
The above basically captures something on a whole line, then matches a load of stuff in between, then captures the same thing about on a line.
What's wrong
The regex works, but only sometimes. I can't figure out the pattern. I've whittled it down to this so far. If I do a "Replace All" on the replacement pattern \1\2 then the "replace all" leaves me with just line 3, which is "elative backreferences32". This is wrong:
dasfdasfdsfasdfasdfadsfasdf
elative backreferenceswe
elative backreferences32
elative backreferencesd
elative backreferencdesdfdasdfsdafsd
asfasdfasdfasdfasdfasfdsaasdfas
asdfasdfafds asdfasfdsafasd asdfdasfsd
elative backreferencessfhdfg
x
y
x
But if I delete any line from that file, then only the consecutive lines x then y then x are replaced by a single line xx as I'd expect.
Notes
I'd like to keep this question focused mostly on why the regex is
bugging out. Suggestions about alternative ways to find duplicate
lines are of course good but the main reason I'm asking this is to
figure out what's going on with the regex and Notepad++.
I don't really need the replace part of this, just the find, I was just using the replace to try to figure out what groups were being captured in an attempt to debug this
The find behaviour is also buggy. I noticed this first actually. It first finds the match I'm actually looking for, and then if I click "Find Next" again, it highlights all the text.
Hypotheses
There is a bug in Notepad++ v7.8.4 64 bit. I just updated today so maybe they haven't caught it yet.
Does the in-between part of the match, (?:(?:\s|.)+), maybe cycle
around past the end of file character and loop right back to the
original match? If so, I'd say that's still a bug, because AFAIK a
regex should only consume each character once.
I thought there might be a limit to the number of characters in the file, but I disproved this hypothesis by playing around with the file, adding characters here and there. Two files with the same number of lines and the same number of characters can behave differently: one with buggy behaviour, one without.
Screenshots
Before
After Without Matches Newline (The intended configuration)
After With Matches Newline (for Experimentation)
(?:\s|.) should be avoid as it causes unexpected behaviour, I suggest using [\s\S] instead:
Find what: ^(.+)$[\s\S]+?^(\1)$
Replace with: $1$2

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Eclipse RegEx search for multiple words, ignoring comments

I try to use Eclipse's RegEx search function to search for the words 'foo' or 'bar', ignoring comments.
This is what I've got so far:
^(?!\s*(//|\*)).*(foo|bar)
The comment restrictions of my solutions are okay for me (anyway, if somebody has a better solution without dramatically extending the regex, I'd be glad to hear about it):
Single-line comments have to start at the beginning of the line, maybe indented (so I don't care that return null; // foo won't be ignored).
Multi-line comments start at the beginning of the line with a single asterisk, maybe indended (so /* foo won't be ignored, while bar \n * foo will be ignored even though it's not really a comment).
My problem is, that now the whole line up to (and including) 'foo' or 'bar' is highlighted in the search results. I only want 'foo' or 'bar' (or both, if both appear in the same line) to be highlighted.
I tried to include a positive look-ahead (in several variants) to achieve this:
^(?!\s*(//|\*))(?=.*)(foo|bar)
This results in no results. I don't understand why. Thanks for any hint!
The problem is that lookaround assertions don't actually match and consume any text. So in the regex
^(?!\s*(//|\*))(?=.*)(foo|bar)
the texts foo or bar can only be matched at the start of the line because the regex engine hasn't yet moved after matching all the lookaheads.
That means if you don't want the text leading up to foo/bar to be matched, you need a look*behind* assertion instead. However, only .NET and JGSoft regex engines support indefinite quantifiers like the asterisk inside a lookbehind assertion. Java/Eclipse do not support this.
In .NET, you could search for
(?<!^\s*(//|\*).*)(foo|bar)

How many backslashes are required to escape regexps in emacs' "Customize" mode?

I'm trying to use emacs' customize-group packages to tweak some parts of my setup, and I'm stymied. I see things like this in my .emacs file after I make changes with customize:
'(tramp-backup-directory-alist (quote (("\\\\`.*\\\\'" . "~/.emacs.d/autobackups"))))
This was the result of putting the following into the customize text field:
Regexp matching filename: \\`.*\\'
This is a representative sample: I'm actually trying to change several things that want a regexp, and they all show this same problem. How many layers of quoting are there, really? I can't seem to find the magic number of backslashes to get the gosh-dang thing to do what I'm asking it to, even for the simplest regular expressions like .*. Right now, the given customization produces - nothing. It makes no change from emacs' default behavior.
Better yet, where on earth is this documented? It's a little difficult to Google for, but I've been trying quite a few things there as well as in the official documentation and the Emacs wiki. Where is an authoritative source for how many dang backslashes one needs to make a regular expression in customize-mode actually work - or at the very least, to fail with some kind of warning instead of failing silently?
EDIT: As so often happens with questions asked in anger, I was asking the wrong question. Fortunately the answers below, led me to the answer to the question that I needed, which was about quoting rules. I'm going to try to write down what I learned here, because I find the documentation and Googleable resources to be maddeningly obscure about this. So here are the quoting rules I found by trial and error, and I hope that they help someone else, inspire correction, or both.
When an emacs customize-mode buffer asks you for a "Regexp matching filename", it is being, as emacs often is, both terse and idiosyncratic (how often the creator's personality is imparted to the creation!). It means, for one thing, a regexp that will be compared to the whole path of the file in search of a match, not just to the name of the file itself as you might assume from the term "filename". This is the same sense of "filename" used in emacs' buffer-file-name function, for example.
Further, although if you put foo in the field, you'll see "foo" (with double-quotes) written to the actual file, that's not enough quoting and not the right quoting. You need to quote your regexp with the quoting style that, as far as I can tell, only emacs uses: the ``backtick-foo-single-quote'`scheme. And then you need to escape that, making it \`backslash-backtick-foo-backslash-single-quote\' (and if you think that's a headache to input in Markdown, it's more so in emacs).
On top of this, emacs appears to have a rule that the . regexp special character does not match a / at the beginning of filenames, so, as was happening to me above, the classic .* pattern will appear to match nothing: to match "all files", you actually need the regexp /.*, which then you stuff into the quote format of customize-mode to produce \`/.*\', after which customize paints another layer of escaping onto it and writes it to the customization file.
The final result for one of my efforts - a setting such that #autosave# files don't gunk up the directory you're working in, but instead all live in one place:
(custom-set variables
'(auto-save-file-name-transforms (quote (
("\\`/[^/]*:\\([^/]*/\\)*\\([^/]*\\)\\'" "~/.emacs.d/autobackups/\\2" t)
("\\`/.*/\\(.*?\\)\\'" "~/.emacs.d/autobackups/\\1" t)
))))
Backslashes in elisp are a far greater threat to your sanity than parentheses.
EDIT 2: Time for me to be wrong again. I finally found the relevant documentation (through reading another Stack Overflow question, of course!): Regexp Backslash Constructs. The crucial point of confusion for me: the backtick and single quote are not quoting in this context: they're the equivalent of perl's ^ and $ special characters. The backslash-backtick construct matches an empty string anchored at the beginning of the string being checked for a match, and the backslash-single-quote construct matches the empty string at the end of the string-under-consideration. And by "string under consideration," I mean "buffer, which just happens to contain only a file path in this case, but you need to match the whole dang thing if you want a match at all, since this is elisp's global regexp behavior."
Swear to god, it's like dealing with an alien civilization.
EDIT 3: In order to avoid confusing future readers -
\` is the emacs regex for "the beginning of the buffer." (cf Perl's \A)
\' is the emacs regex for "the end of the buffer." (cf Perl's \Z)
^ is the common-idiom regex for "the beginning of the line." It can be used in emacs.
$ is the common-idiom regex for "the end of the line." It can be used in emacs.
Because regex searches across multi-line bodies of text are more common in emacs than elsewhere (e.g. M-x occur), the backtick and single-quote special characters are used in emacs, and as best as I can tell, they're used in the context of customize-mode because if you are considering generic unknown input to a customize-mode field, it could contain newlines, and therefore you want to use the beginning-of-buffer and end-of-buffer special characters because the beginning and end of the input are not guaranteed to be the beginning and end of a line.
I am not sure whether to regret hijacking my own Stack Overflow question and essentially turning it into a blog post.
In the customize field, you'd enter the regexp according to the syntax described here. When customize writes the regexp into a string, any backslashes or double-quote chars in the regexp will be escaped, as per regular string escaping conventions.
So in short, just enter single backslashes in the regexp field, and they'll get correctly doubled up in the resulting custom-set-variables clause written to your .emacs.
Also: since your regexp is for matching filenames, you might try opening up a directory containing files you'd like to match, and then run M-x re-builder RET. You can then enter the regexp in string-escaped format to confirm that it matches those files. By typing % m in a dired buffer, you can enter a regexp in unescaped format (ie. just like in the customize field), and dired will mark matching filenames.

Bug in Mathematica: regular expression applied to very long string

In the following code, if the string s is appended to be something like 10 or 20 thousand characters, the Mathematica kernel seg faults.
s = "This is the first line.
MAGIC_STRING
Everything after this line should get removed.
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
...";
s = StringReplace[s, RegularExpression#"(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*"->""]
I think this is primarily Mathematica's fault and I've submitted a bug report and will follow up here if I get a response. But I'm also wondering if I'm doing this in a stupid/inefficient way. And even if not, ideas for working around Mathematica's bug would be appreciated.
Mathematica uses PCRE syntax, so it does have the /s aka DOTALL aka Singleline modifier, you just prepend the (?s) modifier before the part of the expression in which you want it to apply.
See the RegularExpression documentation here: (expand the section labeled "More Information")
http://reference.wolfram.com/mathematica/ref/RegularExpression.html
The following set options for all regular expression elements that follow them:
(?i) treat uppercase and lowercase as equivalent (ignore case)
(?m) make ^ and $ match start and end of lines (multiline mode)
(?s) allow . to match newline
(?-c) unset options
This modified input doesn't crash Mathematica 7.0.1 for me (the original did), using a string that is 15,000 characters long, producing the same output as your expression:
s = StringReplace[s,RegularExpression#".*MAGIC_STRING(?s).*"->""]
It should also be a bit faster for the reasons #AlanMoore explained
The best way to optimize the regex depends on the internals of Mathematica's regex engine, but I would definitely get rid of the (.|\\n)*, as #Simon mentioned. It's not just the alternation--although it's almost always a mistake to have an alternation in which every alternative matches exactly one character; that's what character classes are for. But you're also capturing each character when you match it (because of the parentheses), only to throw it away when you match the next character.
A quick scan of the Mathematica regex docs doesn't turn up anything like the /s (Singleline or DOTALL) modifier, so I recommend the old JavaScript standby, [\\s\\S]* -- match anything that is whitespace or anything that isn't whitespace. Also, it might help to add the $ anchor to the end of the regex:
"(^|\\n)[^\\n]*MAGIC_STRING[\\s\\S]*$"
But your best option would probably be not to use regexes at all. I don't see anything here that requires them, and it would probably be much easier as well as more efficient to use Mathematica's normal string-manipulation functions.
Mathematica is a great executive toy but I'd advise against trying to do anything serious with it like regexs over long strings or any kind of computation over significant amounts of data (or where correctness is important). Use something tried and tested. Visual F# 2010 takes 5 milliseconds and one line of code to get the correct answer without crashing:
> let str =
"This is the first line.\nMAGIC_STRING\nEverything after this line should get removed." +
String.replicate 2000 "0123456789";;
val str : string =
"This is the first line.
MAGIC_STRING
Everything after this li"+[20022 chars]
> open System.Text.RegularExpressions;;
> #time;;
--> Timing now on
> (Regex "(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*").Replace(str, "");;
Real: 00:00:00.005, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
val it : string = "This is the first line."