vim search regular expression replace with register - regex

I'd like to search a regex pattern with vim and replace the matches with a paste from a register. In detail that means:
acb123acb
asokqwdad
def442ads
asduiosdf
df567hjk
should finish with
acbXYZacb
asokqwdad
defPOWads
asduiosdf
dafMANhjk
where I had
XYZ
POW
MAN
in a register A (:g/pattern/y A)
A regex pattern to search for might be [0-9]{3} to match the 3 numbers from the text block.
Block mode would help if there were no lines between the matches...
I could use a perl script therefore of course. However I'm sure, if possible in vim it were a lot faster, right?
Thank you in advance

If you want to replace all strings matching [0-9]{3} with the same value, which happens to be the contents of register a:
:%s/\v\d{3}/\=#a/g
In detail:
:% - apply to all lines in buffer
s/.../.../g - replace all occurrences
\v - what follows is a "very magic" regular expression
\d{3} - match 3 digits
\= - replace with the value of...
#a - register a
If on the other hand you want to read replacement values from register a:
:let a=getreg('a', 1, 1)
:%s/\v\d{3}/\=remove(a, 0)/g
In detail:
let a=getreg('a', 1, 1) - transfer the contents of register a to a list, imaginatively also named a
then same as above, except...
remove(a, 0) - deletes the first element in list a and returns it.
Also, VimL is, sadly, nowhere near as fast as Perl. :)

Related

Regex to remove double lines ignoring the punctation marks and spaces in Notepad++ [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 months ago.
Is it possible to remove duplications with ignoring the punctation marks and spaces in Notepad++? I would keep one of them matching lines (doesn't matter which to keep).
My examples are from the txt file:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rough work, iconoclasm, but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Rule No.1: Never lose money. Rule No.2: Never forget rule No.1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
Self-esteem isn't everything it's just that there's nothing without it. Gloria Steinem
You said she's a senior? Babe we're all crazy.
You said, she's a senior! Babe we're ALL crazy.
You said, she's a senior? Babe we're ALL crazy!
Result I need:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
You said, she's a senior! Babe we're ALL crazy.
I can delete 100% matching duplications with regex, but can't find a regex rule to ignore spaces and marks.
I don't think regex is the best tool for this task, but it's a nice challenge. You can match single words using a nested structure like:
((\w+)\W+((\w+)\W+( ... ((\w+)\W+)? ... )?)?(\w*))
When matching this, capture groups 2 to n contain the words 1 to n-1 of a line. The nested structure is necessary to make it non-ambiguous - otherwise, running the regex takes too long.
To match the duplicate lines, we use a similar structure with back-references:
\1\W+(\2\W+( ... (\9\W+)? ... )?)?
This will also match lines that are substrings of the previous line, which is again helpful to improve performance.
Notice that you have to use the \g{n}-notation when using more than 9 references in Notepad++. Moreover, to avoid matching line breaks you should use [^\w\n\r] instead of \W. To further improve performance, unnecessary groups should be non-matching, i.e., (?: ... ).
To generate the rather long regex that solves the problem for, e.g., up to 20 words per line, you can use the following script:
MAX_WORDS = 20
punct = "[^\\w\\n\\r]"
backref = (i) => `\\g{${i}}`
patternKeep = (i) => "(\\w+)[^\\w\\n\\r]+" + (i < 0 ? "" : `(?:${patternKeep(i-1)})?`)
patternRemove = (i) => `${backref(MAX_WORDS-i + 2)}(?:${punct}+` + (i < 0 ? "" : patternRemove(i-1)) + ")?"
console.log("^(" + patternKeep(MAX_WORDS) + "(\\w*))(\\r?\\n" + patternRemove(MAX_WORDS)+ `${punct}*${backref(MAX_WORDS+4)}${punct}*)+$`)
When copying this to Notepad++ with settings "Wrap around" on and "Match case" off and replacing with $1, it will remove all duplicate lines in your example.
I doubt that it can be done purely with regular expressions. If it can then I imagine that the expression would be difficult to understand and difficult to maintain. Instead I would suggest a multi-step approach.
Step 1 - modify each line to be: original-line separator original-line.
Step 2 - convert it to be line-without-punctuation separator original-line.
Step 3 - sort the lines
Step 4 - remove duplicated lines
Step 5 - remove line-without-punctuation and separator leaving just the original line.
In more detail:
In all the replaces below: select "Wrap around", unselect "Dot matches newline", unselect "Match whole word only" and unselect "Match case".
Step 1 - choose a separator, some text that is not punctuation and does not occur in the file. Here I use qqq. Do a regular expression replace of ^(.+)$ with \1qqq\1.
Step 2 - remove any punctuation before the separator. Repeatedly do a regular expression replace of [!',-.:?]+(.*qqq) with \1 until no more replacements are made. This expression matches all the punctuation in the example, but you may need to add more for your full text. Also need to reduce multiple spaces to singles, so repeatedly do a regular expression replace of +(.*qqq) with \1 until no more replacements are made. One final step to handle spaces before the qqq do a regular expression replace of qqq with qqq (this could also use a non-regular expression replace).
Step 3 - sort the lines lexicographically.
Step 4 - remove duplicated lines. Repeatedly do a regular expression replace of ^(.*qqq).*\R\1 with \1 until no more replacements are made.
Step 5 - Remove unwanted text leaving the original line. Do a regular expression replace of ^.*qqq with nothing (the empty string).
If all punctuation can be deleted and the result being a line without punctuation then could simple do a regular expression replace of [!',-.:? ]+ with , a sort and finally a remove duplicates.
Previously this question attracted an answer, but the author deleted it. To me it was so interesting because a special technique was illustrated. In a comment the answerer pointed me towards another thread to read more about it.
After experimenting a bit with that answer, an idea was the following pattern. Settings in NP++ are to uncheck: [ ] match case, [ ] .matches newline - Replace with emptystring.
^(?>[^\w\n]*(\w++)(?=.*\R(\2?+[^\w\n]*\1\b)))+[^\w\n]*\R(?=\2[^\w\n]*$)
Here is the demo in Regex101 - Assumption is, that duplicate lines are consecutive (like sample).
Most of the used regex-tokens can be looked up in the Stack Overflow Regex FAQ.
In short words, the mechanism used is to capture words from one line to the first group (\w++) while inside the lookahead (?=.*\R(\2?+...\1\b))) a second group in the consecutive line is "growing" from itself plus the captures until \R(?=\2...$) it either matches all words or fails.
Illustration of some steps from the regex101 debugger:
The second group holds the substring of the consecutive line that matches words and order of the previous line. It expands at each repetition from optionally itself and a word from the previous line. Separated by [^\w\n]* any amount of characters that are not word characters or newline.
For making it work, matching is done without giving back at crucial points (prevent backtracking).

Regex Erasing all except numbers with limited digits

What I want to do is erase everything except \d{4,7} only by replacing.
Any ideas to get this?
ex)
G-A15239L → 15239
(G-A and L should be selected and replaced by empty strings)
now200316stillcovid19asdf → 200316
(now and stillcovid19asdf should be selected and replaced by empty strings)
Also, replacing text is not limited as empty string.
substitutions such as $1 are possible too.
Using Regex in 'Kustom' apps. (including KLCK, KLWP, KWGT)
I don't know which engine it's using because there are no information about it
You may use
(\d{4,7})?.?
Or
(\d{4,7})|.
and replace with $1. See the regex demo.
Details
(\d{4,7})? - an optional (due to ? at the end - if it is missing, then the group is obligatory) capturing group matching 1 or 0 occurrences of 4 to 7 digits
| - or
.? - any one char other than line break chars, 1 or 0 times when ? is right after it.
So, any match of 4 to 7 digits is kept (since $1 refers to the Group 1 value) and if there is a char after it, it is removed.
It looks as if the regex is Java based since all non-matching groups are replaced with null:
So, the only possible solution is to use a second pass to post-process the results, just replace null with some kind of a delimiter, a newline for example.
Search: .*?(\d{4,7})[^\d]+|.*
Replace: $1
in for instance Notepad++ 6.0 or better (which comes with built-in PCRE support) works with your examples:
jalsdkfilwsehf
now200316stillcovid19asdf
G-A15239L
becomes:
200316
15239

Regex cannot prevent a match of suffix name made up using I,V,X and SR/JR

I am trying to prevent the inclusion of suffix name, for example, JR/SR, or other suffix made up of using I,V,X using regular expression way. To accomplish this I have implemented the following regex
((^((?!((\b((I+))\b)|(\b(V+)\b)|(\b(X+)\b)|\b(IV)\b|(\b(V?I){1,2}\b)|(\b(IX)\b)|(\bX[I|IX]{1,2}\b)|(\bX|X+[V|VI]{1,2}\b)|(\b(JR)\b)|(\b(SR)\b))).)*$))
Using this I am able to prevent various possible combination eg.,
'Last Name I',
'Last Name II',
'Last Name IJR',
'Last Name SRX' etc.
However, there are still couple of combinations remaining, which this regex can match. eg., 'Last Name IXV' or 'Last Name VXI'
These two I am not able to debug. Please suggest me in which part of this regex I can make changes to satisfy the requirement.
Thank you!
Try this pattern: .+\b(?:(?>[JS]R)|X|I|J|V)+$
Explanation:
.+ - match one or more of any characters
\b - word boudnary
(?:...) - non-capturing group
(?>...) - atomic group
[JS]R - match whether S or J followed by R
| - alternation: match what is on the left OR what's on the right
+ - quantifier: match one or more times preceeding pattern
$ - match end of the string
Demo
In order to solve this I have worked on the above regex a little bit more. And here is the final result that can successfully match up with the "roman numeral" upto thirty constituted I, V, and X.
"(\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b|\b(V|X)\b|\bV[I]{1,2}\b|\b((?!XVV|XVX)X([IXV]{1,2}))\b|\b[S|J]R\b)|^$"
What I have done here is:
I have taken those input into consideration which are standalone,
that is: SR or XXV I have observed the incorrect pattern and
have restricted them to match as a positive result.
Separate input has been ensured using \b the word boundary.
Word-boundary: It suggests that starting of a word, that means in
simple words it says "yes there is a word" or "no it is not."
it has done in the following way-
using negative lookahead (?!(IIX|IIV|IVV|IXX|IXI))
How I have arrived on this solution is given as follows:
I have observed closely all the pattern first, that from I to X - that is:
I
I I
I I I
I V
V
V I
V I I
V I I I (it is out of the range of 3 characters.)
I X
X
we have an I, V, and X at first position. Then there is another I, X and V
on the second position. After then again same I and V. I have
implemented this in the following regex of the above written code:
\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b
Start it with I and then look for any of I, V, or X in a range of 'zero' to 'three' characters, and do neglect invalid numbers written inside the ?!(IIX|IIV|IVV|IXX|IXI) Similarly, I have done with other combinations given below.
Then for V and X : \b(V|X)\b
Then for the VI, VII: \bV[I]{1,2}\b
Then for the XI - XXX: \b((?!XVV|XVX)X([IXV]{1,2}))\b
To validate a suffix name, i.e. JR, SR, one can use following regex: \b[S|J]R\b
and the last (^$) is for matching a blank string or in other words, when no input has provided to the given input-box or textbox.
You may post any question or suggestion, if you have.
Thanks!
Ps: This regex is simply a solution to validate "roman numbers" from 1 to 30 using I, V, and X. I hope it helps to learn a bit to each and every newbie of regex.
I solved this with a more explicit:
(.+) (?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$))|(.+)
I know I could do something like [JS]R but I like the way this reads:
(.+) match any characters and then a space
(?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$)) atomically look for but don't match endings like JR etc
|(.+) if you don't find the endings then match any characters
Feel free to add the endings you'd like to suit your needs.

How to apply conditional treatment with line.endswith(x) where x is a regex result?

I am trying to apply conditional treatment for lines in a file (symbolised by list values in a list for demonstration purposes below) and would like to use a regex function in the endswith(x) method where x is a range page-[1-100]).
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-2']
for line in lines:
if line.startswith('http') and line.endswith('page-2'):
print line
So the required functionality is that if the value starts with http and ends with a page in the range of 1-100 then it will be returned.
Edit: After reflecting on this, I guess the corollary questions are:
How do I make a regex pattern ie page-[1-100] a variable?
How do I then use this variable eg x in endswith(x)
Edit:
This is not an answer to the original question (ie it does not use startswith() and endswith()), and I have no idea if there are problems with this, but this is the solution I used (because it achieved the same functionality):
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-100']
for line in lines:
match_beg = re.search( r'^http://', line)
match_both = re.search( r'^http://.*page-(?:[1-9]|[1-9]\d|100)$', line)
if match_beg and not match_both:
print match_beg.group()
elif match_beg and match_both:
print match_both.group()
I don't know python well enough to paste usable code, but as far as the regular expression is concerned, this is rather trivial to do:
page-(?:[2-9]|[1-9]\d|100)$
What this expression will match:
page- is just a fixed string that will be matched 1:1 (case insensitive if you set Options for that).
(?:...) is a non-capturing group that's just used for separating the following branching.
| all act as "either or" with the expressions being to their left/right.
[2-9] will match this numerical range, i.e. 2-9.
[1-9]\d will match any two Digit number (10-99); \d matches any digit.
100 is again a plain and simple match.
$ will match the line end or end of string (again based on settings).
Using this expression you don't use any specific "ends with" functionality (that's given through using $).
Considering this will have to parse the whole string anyway, you may include the "begins with" check as well, which shouldn't cause any additional overhead (at least none you'd notice):
^http://.*page-(?:[2-9]|[1-9]\d|100)$
^ matches the beginning of the line or string (based on settings).
http:// is once again a plain match.
. will match any character.
* is a quantifier "none or more" for the previous expression.
To get you going in the right direction, the Regex that matches your needed range of pages is:
^http.*page-([2-9]?|[1-9][0-9]|100)$
this will match lines that start with http and end with page-<2 to 100> inclusive.

Substitute the n-th occurrence of a word in vim

I saw other questions dealing with the finding the n-th occurrence of a word/pattern, but I couldn't find how you would actually substitute the n-th occurrence of a pattern in vim. There's the obvious way of hard coding all the occurrences like
:s/.*\(word\).*\(word\).*\(word\).*/.*\1.*\2.*newWord.*/g
Is there a better way of doing this?
For information,
s/\%(\(pattern\).\{-}\)\{41}\zs\1/2/
also works to replace 42th occurrence. However, I prefer the solution given by John Kugelman which is more simple -- even if it will not limit itself to the current line.
You can do this a little more simply by using multiple searches. The empty pattern in the :s/pattern/repl/ command means replace the most recent search result.
:/word//word//word/ s//newWord/
or
:/word//word/ s/word/newWord/
You could then repeat this multiple times by doing #:, or even 10#: to repeat the command 10 more times.
Alternatively, if I were doing this interactively I would do something like:
3/word
:s//newWord/r
That would find the third occurrence of word starting at the cursor and then perform a substitution.
Replace each Nth occurrence of PATTERN in a line with REPLACE.
:%s/\(\zsPATTERN.\{-}\)\{N}/REPLACE/
To replace the nth occurrence of PATTERN in a line in vim, in addtion to the above answer I just wanted to explain the pattern matching i.e how it is actually working for easy understanding.
So I will be discussing the \(.\{-}\zsPATTERN\)\{N} solution,
The example I will be using is replacing the second occurrence of more than 1 space in a sentence(string).
According to the pattern match code->
According to the zs doc,
\zs - Scroll the text horizontally to position the cursor at the start (left
side) of the screen.
.\{-} 0 or more as few as possible (*)
Here . is matching any character and {} the number of times.
e.g ab{2,3}c here it will match where b comes either 2 or 3 times.
In this case, we can also use .* which is 0 or many as many possible.
According to vim non-greedy docs, "{-}" is the same as "*" but uses the shortest match first algorithm.
\{N} -> Matches n of the preceding atom
/\<\d\{4}\> search for exactly 4 digits, same as /\<\d\d\d\d>
**ignore these \<\> they are for exact searching, like search for fred -> \<fred\> will only search fred not alfred.
\( \) combining the whole pattern.
PATTERN here is your pattern you are matching -> \s\{1,} (\s - space and {1,} as explained just above, search for 1 or more space)
"abc subtring def"
:%s/\(.\{-}\zs\s\{1,}\)\{2}/,/
OUTPUT -> "abc subtring,def"
# explanation: first space would be between abc and substring and second
# occurence of the pattern would be between substring and def, hence that
# will be replaced by the "," as specified in replace command above.
This answers your actual question, but not your intent.
You asked about replacing the nth occurrence of a word (but seemed to mean "within a line"). Here's an answer for the question as asked, in case someone finds it like I did =)
For weird tasks (like needing to replace every 12th occurrence of "dog" with "parrot"), I like to use recursive recordings.
First blank the recording in #q
qqq
Now start a new recording in q
qq
Next, manually do the thing you want to do (using the example above, replace the 12th occurrence of "dog" with "parrot"):
/dog
nnnnnnnnnnn
delete "dog" and get into insert
diwi
type parrot
parrot
Now play your currently empty "#q" recording
#q
which does nothing.
Finally, stop recording:
q
Now your recording in #q calls itself at the end. But because it calls the recording by name, it won't be empty anymore. So, call the recording:
#q
It will replay the recording, then at the end, as the last step, replay itself again. It will repeat this until the end of the file.
TLDR;
qq
q
/dog
nnnnnnnnnnndiwiparrot<esc>
#q
q
#q
Well, if you do /gc then you can count the number of times it asks you for confirmation, and go ahead with the replacement when you get to the nth :D