How to use backreference to replace part of the string? - regex

.*\t.*\t.*\t.*
I have a 4-column table with 3 tabs as above. How can I replace the 2nd and 3rd tabs as comma in vim? I was trying to use vim to do that, but failed.

Here's one way to do it:
:%s/\(\t.\{-}\)\#<=\t/,/g
It uses a look-behind match to find a previously occurring tab character on the line, so it will match all tabs except for the first, so it will replace the 2nd, 3rd, 4th, etc. tab characters with commas. See :help /\#<= for help on the look-behind operator.
Another way, matching only the second and third tab of a line, and only lines with at least two tab characters, is to use a backreference \1 to store and refer to the contents in between the tabs.
:%s/\t.\{-\}\zs\t\(.\{-}\)\t/,\1,/
This also uses .\{-}, which matches 0 or more characters, but is non-greedy (so it tries to match the smallest sequence possible and stays close to the beginning of the line) and also the \zs marker to only start the replacement at that part of the match (just before the second tab of the line.) Again, see Vim's help docs on search patterns for more details on all those.

Related

Why doesn't my non-greedy match work in vim?

This is test
There are two tabs (\t) in this line. I want to get rid of the part from the beginning to the first tab key, which is "This ", and I used the following pattern:
:s/.\{-}\t//g
It says it can't find the pattern. If I use the following, both tabs are replaced, which isn't what I want. Why doesn't the first pattern work?
:s/.*\t//g
Your first attempt does not work because you are matching the fewest number of any character followed by a tab. The fewest number of any character is zero (0). So both of your tabs match without any other characters.
Based on the comments, the above explanation was incorrect.
Here is one possible solution.
:s/^[^\t]*\t//
This goes from the beginning ^, capturing any number of non-tab characters [^\t]* until it reaches a tab \t.
Your pattern /.\{-}\t didn't work because of the g flag in the :s command. This flag enables global matching so it matches twice. Just remove the flag and it will work. In addition, when deleting something you can omit the replacement part in :s:
:s/.\{-}\t
The full :s/.\{-}\t// is fine as well. Note that in either case it should not say "pattern not found" as you described. If you see that message, there is something else different between your example and your actual text.

Notepad++ Regex - Issue with ^ anchor and repeating patterns

When one tries to remove some characters from the start of a line and the anchored pattern can be found again after the first replace, it will be removed again.
For a very simple example given the input 012345, search pattern ^. and empty replacement, Notepad++ will remove the whole line when using replace all. This is most likely due to the case, that the cursor is still at the start of the line after the first replace and thus matches the ^ anchor again.
How can one ensure that only the actual first character is removed (in my case the expected output would be 12345)?
You can see my workaround in my answer, but maybe there is another nice trick to achieve it.
One can match the rest of the line, capture the match into a group and then use this group as replacement. The pattern in the question could be adjusted to ^.(.*) and be replaced by $1.
This will force the cursor to move forward in the string, so the ^ anchor can't match again.
Another workaround could be finding:
^.(.)?
and replacing it with:
\1
I'm sure this is a subject of a bug report but couldn't find it as of now. In N++:
Anchors are buggy
By Replace All functionality, replacements are supposed to not be a subject to re-matching. But they are, when replacement strings are invisible / zero-length characters.
Take care of them.

Regex to remove the first 2 lines of a text file

I am trying to delete only the first 2 lines of a text file.
I tried using \A.*, but this gets the first line and deletes the rest.
Is there a way to do the inverse?
It is maybe not the most convenient way, but it is possible with Regex:
^.*\n.*\n([\s\S]*)$
With default settings (neither single-line nor multi-line modifiers) the '.' captures everything, except newline. Therfore, .*\n captures one line, including the new line character. Repeat it twice, and we are at the beginning of the third line. Now capture all characters, including the new line character ([\s\S] is a nice workaround for this behavior) until the end of the file $.
Then substitute by the first capturing group
\1
and you have everything but the first 2 lines.
The details depend on your regex engine, how you give the substitute string. And depending on the platform or the used new line character of the file, you might need to exchange the \n with \r\n or \r or the one that matches it all (\r\n?|\n).
Here is a working Demo.

Remove everything before and after variable=int

I'm terrible at regex and need to remove everything from a large portion of text except for a certain variable declaration that occurs numerous times, id like to remove everything except for instances of mc_gross=anyint.
Generally we'd need to use "negative lookarounds" to find everything but a specified string. But these are fairly inefficient (although that's probably of little concern to you in this instance), and lookaround is not supported by all regex engines (not sure about notepad++, and even then probably depends on the version you're using).
If you're interested in learning about that approach, refer to How to negate specific word in regex?
But regardless, since you are using notepad++, I'd recommend selecting your target, then inverting the selection.
This will select each instance, allowing for optional white space either side of the '=' sign.
mc_gross\s*=\s*\d+
The following answer over on super user explains how to use bookmarks in notepad++ to achieve the "inverse selection":
https://superuser.com/questions/290247/how-to-delete-all-line-except-lines-containing-a-word-i-need
Substitute the regex they're using over there, with the one above.
You could do a regular expression replace of ^.*\b(mc_gross\s*=\s*\d+)\b.*$ with \1. That will remove everything other than the wanted text on each line. Note that on lines where the wanted text occurs two or more times, only one occurrence will be retained. In the search the ^.*\b matches from start-of-line to a word boundary before the wanted text; the \b.*$ matches everything from a word boundary after the wanted text until end of line; the round brackets capture the wanted text for the replacement text. If text such as abcmc_gross=13def should be matched and retained as mc_gross=13 then delete the \bs from the search.
To remove unwanted lines do a regular expression search for ^mc_gross\s*=\s*\d+$ from the Mark tab, tick Bookmark line and click Mark all. Then use Menu => Search => Bookmark => Remove unmarked lines.
Find what: [\s\S]*?(mc_gross=\d+|\Z)
Replace with: \1
Position the cursor at the start of the text then Replace All.
Add word boundaries \b around mc_gross=\d+ if you think it's necessary.

Regex: remove lines not starting with a digit

I have been fighting this problem with the help of a RegEx cheat sheet, trying to figure out how to do this, but I give up... I have this lengthy file open in Notepad++ and would like to remove all lines that do not start with a digit (0..9). I would use the Find/Replace functionality of N++. I am only mentioning this as I am not sure what Regex implementation is N++ using... Thank you
Example. From the following text:
1hello
foo
2world
bar
3!
I would like to extract
1hello
2world
3!
not:
1hello
2world
3!
by doing a find/replace on a regular expression.
You can clear up those line with ^[^0-9].* but it will leave blank lines.
Notepad++ use scintilla, and also using its regex engine to match those.
\r and \n are never matched because in
Scintilla, regular expression searches
are made line per line (stripped of
end-of-line chars).
http://www.scintilla.org/SciTERegEx.html
To clear up those blank lines, only way is choose extended mode, and replace \n\n to \n, If you are in windows mode change \r\n\r\n to \r\n
[^0-9] is a regular expression that matches pretty much anything, except digits. If you say ^[^0-9] you "anchor" it to the start of the line, in most regular expression systems. If you want to include the rest of the line, use ^[^0-9].+.
^[^\d].* marks a whole line whose first character is not a digit. Check if there are really no whitespaces in front of the digits. Otherwise you'd have to use a different expression.
UPDATE:
You will have to do ot in two steps. First empty the lines that do not start with a digit. Then remove the empty lines in extended mode.
One could also use the technique of bookmarking in Notepad++. I started benefiting from this feature (long time present but only more recently made somewhat more visible in the UI) not very long ago.
Simply bring up the find dialogue, type regex for lines not starting with digit ^\D.*$ and select Mark All. This will place blue circles, like marbles, in the left gutter - these are line bookmarks. Then just select from main menu Search -> Bookmark -> Remove bookmarked lines.
Bookmarks are cool, you could extract these lines by simply selecting to copy bookmarked lines, opening new document and pasting lines there. I sometimes use this technique when reviewing log files.
I'm not sure what you are asking. but the reg exp for finding the lines with a digit at the beginning would be
^\d.*
you can remove all the lines that match the above or alternatly keep all the lines that match this expression:
^[^\d].*