Why doesn't my non-greedy match work in vim? - regex

This is test
There are two tabs (\t) in this line. I want to get rid of the part from the beginning to the first tab key, which is "This ", and I used the following pattern:
:s/.\{-}\t//g
It says it can't find the pattern. If I use the following, both tabs are replaced, which isn't what I want. Why doesn't the first pattern work?
:s/.*\t//g

Your first attempt does not work because you are matching the fewest number of any character followed by a tab. The fewest number of any character is zero (0). So both of your tabs match without any other characters.
Based on the comments, the above explanation was incorrect.
Here is one possible solution.
:s/^[^\t]*\t//
This goes from the beginning ^, capturing any number of non-tab characters [^\t]* until it reaches a tab \t.

Your pattern /.\{-}\t didn't work because of the g flag in the :s command. This flag enables global matching so it matches twice. Just remove the flag and it will work. In addition, when deleting something you can omit the replacement part in :s:
:s/.\{-}\t
The full :s/.\{-}\t// is fine as well. Note that in either case it should not say "pattern not found" as you described. If you see that message, there is something else different between your example and your actual text.

Related

Notepad++ Regex - Issue with ^ anchor and repeating patterns

When one tries to remove some characters from the start of a line and the anchored pattern can be found again after the first replace, it will be removed again.
For a very simple example given the input 012345, search pattern ^. and empty replacement, Notepad++ will remove the whole line when using replace all. This is most likely due to the case, that the cursor is still at the start of the line after the first replace and thus matches the ^ anchor again.
How can one ensure that only the actual first character is removed (in my case the expected output would be 12345)?
You can see my workaround in my answer, but maybe there is another nice trick to achieve it.
One can match the rest of the line, capture the match into a group and then use this group as replacement. The pattern in the question could be adjusted to ^.(.*) and be replaced by $1.
This will force the cursor to move forward in the string, so the ^ anchor can't match again.
Another workaround could be finding:
^.(.)?
and replacing it with:
\1
I'm sure this is a subject of a bug report but couldn't find it as of now. In N++:
Anchors are buggy
By Replace All functionality, replacements are supposed to not be a subject to re-matching. But they are, when replacement strings are invisible / zero-length characters.
Take care of them.

Remove everything before and after variable=int

I'm terrible at regex and need to remove everything from a large portion of text except for a certain variable declaration that occurs numerous times, id like to remove everything except for instances of mc_gross=anyint.
Generally we'd need to use "negative lookarounds" to find everything but a specified string. But these are fairly inefficient (although that's probably of little concern to you in this instance), and lookaround is not supported by all regex engines (not sure about notepad++, and even then probably depends on the version you're using).
If you're interested in learning about that approach, refer to How to negate specific word in regex?
But regardless, since you are using notepad++, I'd recommend selecting your target, then inverting the selection.
This will select each instance, allowing for optional white space either side of the '=' sign.
mc_gross\s*=\s*\d+
The following answer over on super user explains how to use bookmarks in notepad++ to achieve the "inverse selection":
https://superuser.com/questions/290247/how-to-delete-all-line-except-lines-containing-a-word-i-need
Substitute the regex they're using over there, with the one above.
You could do a regular expression replace of ^.*\b(mc_gross\s*=\s*\d+)\b.*$ with \1. That will remove everything other than the wanted text on each line. Note that on lines where the wanted text occurs two or more times, only one occurrence will be retained. In the search the ^.*\b matches from start-of-line to a word boundary before the wanted text; the \b.*$ matches everything from a word boundary after the wanted text until end of line; the round brackets capture the wanted text for the replacement text. If text such as abcmc_gross=13def should be matched and retained as mc_gross=13 then delete the \bs from the search.
To remove unwanted lines do a regular expression search for ^mc_gross\s*=\s*\d+$ from the Mark tab, tick Bookmark line and click Mark all. Then use Menu => Search => Bookmark => Remove unmarked lines.
Find what: [\s\S]*?(mc_gross=\d+|\Z)
Replace with: \1
Position the cursor at the start of the text then Replace All.
Add word boundaries \b around mc_gross=\d+ if you think it's necessary.

Regex negation in vim

In vim I would like to use regex to highlight each line that ends with a letter, that is preceeded by neither // nor :. I tried the following
syn match systemverilogNoSemi "\(.*\(//\|:\).*\)\#!\&.*[a-zA-Z0-9_]$" oneline
This worked very good on comments, but did not work on lines containing colon.
Any idea why?
Because with this regex vim can choose any point for starting match for your regular expression. Obviously it chooses the point where first concat matches (i.e. does not have // or :). These things are normally done by using either
\v^%(%(\/\/|\:)#!.)*\w$
(removed first concat and the branch itself, changed .* to %(%(\/\/|\:)#!.)*; replaced collection with equivalent \w; added anchor pointing to the start of line): if you need to match the whole line. Or negative look-behind if you need to match only the last character. You can also just add anchor to the first concat of your variant (you should remove trailing .* from the first concat as it is useless, and the branch symbol for the same reason).
Note: I have no idea why your regex worked for comments. It does not work with comments the way you need it in all cases I checked.
does this work for you?
^\(\(//\|:\)\#<!.\)*[a-zA-Z0-9_]$

Regex Searching in vim

I'm using vim to do some pattern matching on a text file. I've enabled search highlighting so that I know exactly what is getting matched on each search and am getting confused.
Consider searching for [a-z]* on the following text:123456789abcdefghijklmnopqrstuvwxyxz987654321ABCDEFGHIJKLMNOPQRSTUVWQXZ
I expected this search to match zero or more consecutive characters that are in the range [a-z]. Instead, I get a match on the entire line.
Should this be the expected behaviour?
Thanks,
Andrew
It's matching the empty strings that occur after every character. It has no way of highlighting empty ranges, so it looks like everything is highlighted.
Try searching for [a-z]\+ instead.
Empty string matches [a-z]*... therefore this thing is matching everywhere. Perhaps you want to cut down some of the cases by doing [a-z]+ (1 or more), or [a-z]{4,} (4 or more).
You're not getting a match on the entire line, you're getting a match on every character. Your pattern also matches nothing at all, which is matched by every single character.

Explain this Regular Expression please

Regular Expressions are a complete void for me.
I'm dealing with one right now in TextMate that does what I want it to do...but I don't know WHY it does what I want it to do.
/[[:alpha:]]+|( )/(?1::$0)/g
This is used in a TextMate snippet and what it does is takes a Label and outputs it as an id name. So if I type "First Name" in the first spot, this outputs "FirstName".
Previously it looked like this:
/[[:alpha:]]+|( )/(?1:_:/L$0)/g (it might have been \L instead)
This would turn "First Name" into "first_name".
So I get that the underscore adds an underscore for a space, and that the /L lowercases everything...but I can't figure out what the rest of it does or why.
Someone care to explain it piece by piece?
EDIT
Here is the actual snippet in question:
<column header="$1"><xmod:field name="${2:${1/[[:alpha:]]+|( )/(?1::$0)/g}}"/></column>
This regular expression (regex) format is basically:
/matchthis/replacewiththis/settings
The "g" setting at the end means do a global replace, rather than just restricting the regex to a particular line or selection.
Breaking it down further...
[[:alpha:]]+|( )
That matches an alpha numeric character (held in parameter $0), or optionally a space (held in matching parameter $1).
(?1::$0)
As Roger says, the ? indicates this part is a conditional. If a match was found in parameter $1 then it is replaced with the stuff between the colons :: - in this case nothing. If nothing is in $1 then the match is replaced with the contents of $0, i.e. any alphanumeric character that is not a space is output unchanged.
This explains why the spaces are removed in the first example, and the spaces get replaced with underscores in your second example.
In the second expression the \L is used to lowercase the text.
The extra question in the comment was how to run this expression outside of TextMate. Using vi as an example, I would break it into multiple steps:
:0,$s/ //g
:0,$s/\u/\L\0/g
The first part of the above commands tells vi to run a substitution starting on line 0 and ending at the end of the file (that's what $ means).
The rest of the expression uses the same sorts of rules as explained above, although some of the notation in vi is a bit custom - see this reference webpage.
I find RegexBuddy a good tool for me in dealing with regexs. I pasted your 1st regex in to Buddy and I got the explanation shown in the bottom frame:
I use it for helping to understand existing regexs, building my own, testing regexs against strings, etc. I've become better # regexs because of it. FYI I'm running under Wine on Ubuntu.
it's searching for any alpha character that appears at least once in a row [[:alpha:]]+ or space ( ).
/[[:alpha:]]+|( )/(?1::$0)/g
The (?1 is a conditional and used to strip the match if group 1 (a single space) was matched, or replace the match with $0 if group 1 wasn't matched. As $0 is the entire match, it gets replaced with itself in that case. This regex is the same as:
/ //g
I.e. remove all spaces.
/[[:alpha:]]+|( )/(?1:_:/\L$0)/g
This regex is still using the same condition, except now if group 1 was matched, it's replaced with an underscore, and otherwise the full match ($0) is used, modified by \L. \L changes the case of all text that comes after it, so \LABC would result in abc; think of it as a special control code.