Notepad++ Regex - Issue with ^ anchor and repeating patterns - regex

When one tries to remove some characters from the start of a line and the anchored pattern can be found again after the first replace, it will be removed again.
For a very simple example given the input 012345, search pattern ^. and empty replacement, Notepad++ will remove the whole line when using replace all. This is most likely due to the case, that the cursor is still at the start of the line after the first replace and thus matches the ^ anchor again.
How can one ensure that only the actual first character is removed (in my case the expected output would be 12345)?
You can see my workaround in my answer, but maybe there is another nice trick to achieve it.

One can match the rest of the line, capture the match into a group and then use this group as replacement. The pattern in the question could be adjusted to ^.(.*) and be replaced by $1.
This will force the cursor to move forward in the string, so the ^ anchor can't match again.

Another workaround could be finding:
^.(.)?
and replacing it with:
\1
I'm sure this is a subject of a bug report but couldn't find it as of now. In N++:
Anchors are buggy
By Replace All functionality, replacements are supposed to not be a subject to re-matching. But they are, when replacement strings are invisible / zero-length characters.
Take care of them.

Related

Add constants to start and end of "file" after multiple replacements

I have already found how to do multiple replacements, bu replacing
(from1)|(from2).....
with
(?1to1)(?2to2)
For example, if I have:
hello all! I think saying hello to all is a nice way to introduce oneself.
and I replace
(hello)|(all)
with
(?1greetings)(?2everyone)
I get
greetings everyone! I think saying greetings to everyone is a nice way to introduce oneself.
Now, I want to add a string at the very beginning and end of file - not each line. So, in that case, my desired result is:
StartOfAllgreetings everyone! I think saying greetings to everyone is a nice way to introduce oneself.EndOfAll
Can you help me with this? Things that I have tried unsuccesfully include using $,\z,\Z to identify the end of line, and using branch reset groups like this (?|(hello)|(all))*
Use
Find What: (^)(?<!(?s:.))|(hello)|(all)|($)(?!(?s:.))
Or with . matches newline ON: (^)(?<!.)|(hello)|(all)|($)(?!.)
Replace with: (?1StartOfAll)(?2greetings)(?3everyone)(?4EndOfAll)
NOTE: In order to also handle the end of file match when another alternative also matches at the end of the file, you need to add optional groups and handle them in the replacement pattern, too:
Find What: (?s)(^)(?<!.)|(hello)(?:($)(?!.))?|(all)(?:($)(?!.))?|($)(?!.)
Replace with: (?1StartOfAll)(?2greetings)(?3EndOfAll)(?4everyone)(?5EndOfAll)(?6EndOfAll)
Now, the (?:($)(?!.))? optional non-capturing groups ensure an additional capture for end of file positions, and that is why there are additional (?nEndOfAll) in the replacement pattern.
Details
The (^)(?<!(?s:.))|(hello)|(all)|($)(?!(?s:.)) has four alternatives, the ones that you are interested are
(^)(?<!(?s:.)) - The first alternative and the start of file is matched (and captured into Group 1) with ^ that is not preceded with any char (ensured with a negative lookbehind (?<!.) - the inline modifier group is added to make sure the regex works regardless of extra regex Notepad++ settings)
($)(?!(?s:.)) - matches (and captures into Group 4) the end of line that is not followed with any char (see the (?!(?s:.)) negative lookahead).
Settings & demo:

Match multiline PCRE until an exception

Is it possible to use a regex to generate matches until a pattern is broken?
https://regex101.com/r/bRQkWM/1
(?m)(?=.*?\*)(\d+)|\*\w*.*$
In this instance, capture the digits at the start of the line, plus the rest of the line provided the line begins with a *.
If the line does not begin with a *, do not match digits or rest of line.
Thank you in advance!
The solution should be (link):
(?m)\G(\d+)\s+\*(\w*.*)(?:[\n\r]+|$)
However... The example you provided has broken pattern right in its first line, as there is no * in such line. That leads me to a conclusion that you wish to ignore all lines before the fist match. If that is your desired specification, then the solution should be (link):
(?m)\A(?:\d+\s+[^*]\w*.*$[\n\r]*)*|\G(\d+)\s+\*(\w*.*)(?:[\n\r]+|$)
This extended regex pattern will work even if there is no broken pattern before the first match.
Please keep in mind that the first match of this solution has to be ignored, as it contains those ignored lines before the first match, or it is empty if there are no lines needed to be ignored.
The key of the above solution(s) is a use of \G, the anchor that matches at the position where the previous match ended.

Why does this regex fail to find matching pattern in string?

I have a file which will contain contents similar to the following example:
?[A7]DA<DA-SG
'[G7]G%SD\$DF
#[27]F:./4FFF
?[P9]W3_2SS_F
'[90]GA\\WTER
Each line ends in \r\n.
From this particular file, I need to replace the F:./4FFF part of the line #[27]F:./4FFF.
So far to start, I have this pattern in order to try and capture the part I need to replace:
\#\[27\]([\w\W]*)\r\n
The problem is that between the closing ] and the \r\n, could be any alphanumeric character or symbol.
I think the problem lies in the capturing group??? What is the correct pattern for this; I will be doing this in VBA.
You might be trying to design an expression, that would somewhat look like:
(?<=^#\[\d{2}\])\S*
DEMO 1
Or maybe just:
^(#\[\d+\])\S*
DEMO 2
Use the multiline option with these (regex.MULTILINE or something)
Two ways to do it
^#\[27\](.*)
or
^#\[27\]([^\r\n]*)
The thing is that \r\n is not needed to stop the match on the line.
It goes to the end without matching them.
This is advantageous if the line is the last one in the file.

Notepad++ regex group capture

I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?
This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.
The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....

Regex negation in vim

In vim I would like to use regex to highlight each line that ends with a letter, that is preceeded by neither // nor :. I tried the following
syn match systemverilogNoSemi "\(.*\(//\|:\).*\)\#!\&.*[a-zA-Z0-9_]$" oneline
This worked very good on comments, but did not work on lines containing colon.
Any idea why?
Because with this regex vim can choose any point for starting match for your regular expression. Obviously it chooses the point where first concat matches (i.e. does not have // or :). These things are normally done by using either
\v^%(%(\/\/|\:)#!.)*\w$
(removed first concat and the branch itself, changed .* to %(%(\/\/|\:)#!.)*; replaced collection with equivalent \w; added anchor pointing to the start of line): if you need to match the whole line. Or negative look-behind if you need to match only the last character. You can also just add anchor to the first concat of your variant (you should remove trailing .* from the first concat as it is useless, and the branch symbol for the same reason).
Note: I have no idea why your regex worked for comments. It does not work with comments the way you need it in all cases I checked.
does this work for you?
^\(\(//\|:\)\#<!.\)*[a-zA-Z0-9_]$