Eclipse RegEx search for multiple words, ignoring comments - regex

I try to use Eclipse's RegEx search function to search for the words 'foo' or 'bar', ignoring comments.
This is what I've got so far:
^(?!\s*(//|\*)).*(foo|bar)
The comment restrictions of my solutions are okay for me (anyway, if somebody has a better solution without dramatically extending the regex, I'd be glad to hear about it):
Single-line comments have to start at the beginning of the line, maybe indented (so I don't care that return null; // foo won't be ignored).
Multi-line comments start at the beginning of the line with a single asterisk, maybe indended (so /* foo won't be ignored, while bar \n * foo will be ignored even though it's not really a comment).
My problem is, that now the whole line up to (and including) 'foo' or 'bar' is highlighted in the search results. I only want 'foo' or 'bar' (or both, if both appear in the same line) to be highlighted.
I tried to include a positive look-ahead (in several variants) to achieve this:
^(?!\s*(//|\*))(?=.*)(foo|bar)
This results in no results. I don't understand why. Thanks for any hint!

The problem is that lookaround assertions don't actually match and consume any text. So in the regex
^(?!\s*(//|\*))(?=.*)(foo|bar)
the texts foo or bar can only be matched at the start of the line because the regex engine hasn't yet moved after matching all the lookaheads.
That means if you don't want the text leading up to foo/bar to be matched, you need a look*behind* assertion instead. However, only .NET and JGSoft regex engines support indefinite quantifiers like the asterisk inside a lookbehind assertion. Java/Eclipse do not support this.
In .NET, you could search for
(?<!^\s*(//|\*).*)(foo|bar)

Related

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Vim S&R to remove number from end of InstallShield file

I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES&gt
PATH_TO_MY_FILES1&gt
...
PATH_TO_MY_FILES22&gt
PATH_TO_MY_FILES_ELSEWHERE&gt
PATH_TO_MY_FILES_ELSEWHERE1&gt
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)&gt:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)&gt:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22&gt to get replaced with PATH_TO_MY_FILES2&gt. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22&gtgt
So basically, it looks like it finds PATH_TO_MY_FILES22&gt, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)&gt
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.

Matching line without and with lower-case letters

I want to match two consecutive lines, with the first line having no lower-case letter and the second having lower-case letter(s), e.g.
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
Why would the Regex ^(?!.*[:lower:]).*$\n^(.*[:lower:]).*$ match each of the following two-line examples?
("1.3.3 Disks 24" "#52")
("1.3.4 Tapes 25" "#53")
("1.5.4 Input/Output 41" "#69")
("1.5.5 Protection 42" "#70")
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
Thanks and regards!
ADDED:
For a example such as:
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
How shall I match only the middle two lines not the first three lines or all the four lines?
To use a POSIX "character class" like [:lower:], you have to enclose it in another set of square brackets, like this: [[:lower:]]. (According to POSIX, the outer set of brackets form a bracket expression and [:lower:] is a character class, but to everyone else the outer brackets define a character class and the inner [:lower:] is obsolete.)
Another problem with your regex is that the first part is not required to consume any characters; everything is optional. That means your match can start on the blank line, and I don't think you want that. Changing the second .* to .+ fixes that, but it's just a quick patch.
This regex seems to match your specification:
^(?!.*[[:lower:]]).+\n(?=.*[[:lower:]]).*$
But I'm a little puzzled, because there's nothing in your sample data that matches. Is there supposed to be?
Using Rubular, we can see what's matched by your initial expression, and then, by adding a few excess capturing groups, see why it matches.
Essentially, the negative look-ahead followed by .* will match anything. If you merely want to check that the first line has no lower-case letters, check that explicitly, e.g.
^(?:[^a-z]+)$
Finally, I'd assuming you want the entire second line, you can do this for the second part:
^(.*?(?=[:lower:]).*?)$
Or to match your inital version:
^(.*?(?=[:lower:])).*?$
The reluctant qualifiers (*?) seemed to be necessary to avoid matching across lines.
The final version I ended up with, thus, is:
^(?:[^a-z]+)$\n^(.*?(?=[:lower:]).*?)$
This can be seen in action with your test data here. It only captures the line ("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205").
Obviously, the regex I've used might be quite specific to Ruby, so testing with your regex engine may be somewhat different. There are many easily Google-able online regex tests, I just picked on Rubular since it does a wonderful job of highlighting what is being matched.
Incidentally, if you're using Python, the Python Regex Tool is very helpful for online testing of Python regexes (and it works with the final version I gave above), though I find the output visually less helpful in trouble-shooting.
After thinking about it a little more, Alan Moore's point about [[:lower:]] is spot on, as is his point about how the data would match. Looking back at what I wrote, I got a little too involved in breaking-down the regex and missed something about the problem as described. If you modify the regex I gave above to:
^(?:[^[:lower:]]+)$\n^(.*?(?=[[:lower:]]).*?)$
It matches only the line ("3.3.1 Paging 187" "#215"), which is the only line with lowercase letters following a line with no lowercase letters, as can be seen here. Placing a capturing group in Alan's expression, yielding ^(?!.*[[:lower:]]).+\n((?=.*[[:lower:]]).*)$ likewise captures the same text, though what, exactly, is matched is different.
I still don't have a good solution for matching multiple lines.

regexIssueTracker not working in CruiseControl.net

I am trying to get an issueUrlBuilder to work in my CruiseControl.NET config, and cannot figure out why they aren't working.
The first one I tried is this:
<cb:define name="issueTracker">
<issueUrlBuilder type="regexIssueTracker">
<find>^.*Issue (\d*).|\n*$</find>
<replace>https://issuetracker/ViewIssue.aspx?ID=$1</replace>
</issueUrlBuilder>
</cb:define>
Then, I reference it in the sourceControl block:
<sourcecontrol type="vaultplugin">
...
<issueTracker/>
</sourcecontrol>
My checkin comments look like this:
[Issue 1234] This is a test comment
I cannot find anywhere in the build reports/logs/etc. where that issue link is converted to a link. Is my regex wrong?
I've also tried the default issueUrlBuilder:
<cb:define name="issueTracker">
<issueUrlBuilder type="defaultIssueTracker">
<url>https://issuetracker/ViewIssue.aspx?ID={0}</url>
</issueUrlBuilder>
</cb:define>
Again, same comments and no links anywhere.
Anyone have any ideas.
It looks like you're trying to match a potentially multiline comment by using .|\n instead of just ., which doesn't match newlines by default. Your first problem is that | has the lowest associativity of all regex constructs, so it's dividing your whole regex into the alternatives ^.*Issue (\d*). or \n*$. You would need to enclose the alternation in a group: (?:.|\n)*.
Another potential problem is that the lines might be separated by \r\n (carriage-return plus linefeed) instead of just \n. If CCNET uses the .NET regex engine under the hood, that won't be a problem because the dot matches \r. But that's not true of all flavors, and anyway, there's always a better way to match anything including newlines than (?:.|\n)*. I suggest you try
<find>^.*Issue (\d*)(?s:.*)$</find>
or
<find>(?s)^.*Issue (\d*).*$</find>
(?s) and (?s:...) are inline modifiers which allow the dot to match line separator characters.
EDIT: It looks like this is a known bug in CCNET. If the inline modifier doesn't work, try replacing . with [\s\S], as you would in a JavaScript regex. Example:
<find>^.*Issue (\d*)[\s\S]*$</find>

regex to find instance of a word or phrase -- except if that word or phrase is in braces

First, a disclaimer. I know a little about regex's but I'm no expert. They seem to be something that I really need twice a year so they just don't stay "on top" of my brain.
The situation: I'd like to write a regex to match a certain word, let's call it "Ostrich". Easy. Except Ostrich can sometimes appear inside of a curly brace. If it's inside of a curly brace it's not a match. The trick here is that there can be spaces inside the curly braces. Also the text is typically inside of a paragraph.
This should match:
I have an Ostrich.
This should not match:
My Emu went to the {Ostrich Race Name}.
This should be a match:
My Ostrich went to the {Ostrich Race Name}.
This should not be a match:
My Emu went to the {Race Ostrich Place}. My Emu went to the {Race Place Ostrich}.
It seems like this is possible with a regex, but I sure don't see it.
I'll offer an alternative solution to doing this, which is a bit more robust (not using regex assertions).
First, remove all the bracketed items, using a regex like {[^}]+} (use replace to change it to an empty string).
Now you can just search for Ostrich (using regex or simple string matching, depending on your needs).
While regular expressions can certainly be written to do what you ask, they're probably not the best tool for this particular type of thing.
One major problem with regular expressions is that they're very good at pattern matching for things that are there, but not so much when you start adding except into the mix.
Regular expressions are not stateful enough to handle this properly without a lot of work, so I would try to find a different path towards a solution.
A character tokenizer that handles the braces would be easy enough to write.
I believe this will work, using lookahead and lookbehind assertions:
(?<!{[^}]*)Ostrich(?![^{]*})
I also tested the case My {Ostrich} went to the Ostrich Race. (where the second "Ostrich" does match)
Note that the lookahead assertion: (?![^{]*}) is optional.. but without it:
My {Ostrich has a missing bracket won't match
My Ostrich also} has a missing bracket will match
which may or may not be desirable.
This works in the .NET regex engine, however, it is not PCRE-compatible because it uses non-fixed-length assertions which are not supported.
Here's a very large regex that almost works.
It will return each "raw" occurrence of the word in a group.
However, the group for the last one will be empty; I'm not sure why. (Tested with .Net)
Parse without whitespace
^(?:
(?:
[^{]
|
(?:\{.*?\})
)*?
(?:\W(Ostrich)\W)?
)*$
Using a positive lookahead with a negation appears to properly match all the test cases as well as multiple Ostriches:
(?<!{[^}]*)Ostrich(?=[^}]*)