RegExp adaption with new line - regex

I've the following RegExp to find the URIs listed above:
"^w{3}\.[\S\-\n|\S]+[^\s.!?,():]+$"
URLs to find:
www.example.org
www.example-example.org
www.example-example.org/product
You'll find it at www.example-
example.org/product.
www.example.org
You'll find it there.
Number 1, 2 and 3 will be found, but 4. delivers "www.example-" as URI.
When there is no point at the end of 4. it would deliver it correct.
EDIT: With deleting ^ and $ only number 5 is not working.
Does anyone can help here?

Your pattern
^w{3}\.[\S\-\n|\S]+[^\s.!?,():]+$
can be simplified to
^w{3}\.[\S\n]+[^\s.!?,():]$
[\S\-\n|\S] this is a character class, no OR possible, no repetition needed, - is included in \S. So [\S\n] is doing the same.
[^\s.!?,():]+ because you match every non whitespace with the expression before this one, here the + is not needed. I assume you just want your pattern not to end with one of the characters from the class.
See your pattern on Regexr (I added \r to your first class, because the line breaks there needs it)
This is a very useful tool to test regexes
I think your problem is that you want to allow line breaks in the link. How do you want to handle this? How do you want to distinguish when the line ends with a link if the word in the next line is just a word or part of the link. I think this is not possible!

The problem is the '^\s' in the second squared bracketed part. Depending on your programming language, '\s' might match the new line. So, you are telling it to match anything that is not a whitespace and it finds a whitespace (new line).
However, this should only be one of your issues. Your regex uses the '^' and '$' characters which mean start and end of line respectively. Try this URL example:
hello from www.example.org
Did it match? I think it will not.

Related

Match multiline PCRE until an exception

Is it possible to use a regex to generate matches until a pattern is broken?
https://regex101.com/r/bRQkWM/1
(?m)(?=.*?\*)(\d+)|\*\w*.*$
In this instance, capture the digits at the start of the line, plus the rest of the line provided the line begins with a *.
If the line does not begin with a *, do not match digits or rest of line.
Thank you in advance!
The solution should be (link):
(?m)\G(\d+)\s+\*(\w*.*)(?:[\n\r]+|$)
However... The example you provided has broken pattern right in its first line, as there is no * in such line. That leads me to a conclusion that you wish to ignore all lines before the fist match. If that is your desired specification, then the solution should be (link):
(?m)\A(?:\d+\s+[^*]\w*.*$[\n\r]*)*|\G(\d+)\s+\*(\w*.*)(?:[\n\r]+|$)
This extended regex pattern will work even if there is no broken pattern before the first match.
Please keep in mind that the first match of this solution has to be ignored, as it contains those ignored lines before the first match, or it is empty if there are no lines needed to be ignored.
The key of the above solution(s) is a use of \G, the anchor that matches at the position where the previous match ended.

Why does this regex fail to find matching pattern in string?

I have a file which will contain contents similar to the following example:
?[A7]DA<DA-SG
'[G7]G%SD\$DF
#[27]F:./4FFF
?[P9]W3_2SS_F
'[90]GA\\WTER
Each line ends in \r\n.
From this particular file, I need to replace the F:./4FFF part of the line #[27]F:./4FFF.
So far to start, I have this pattern in order to try and capture the part I need to replace:
\#\[27\]([\w\W]*)\r\n
The problem is that between the closing ] and the \r\n, could be any alphanumeric character or symbol.
I think the problem lies in the capturing group??? What is the correct pattern for this; I will be doing this in VBA.
You might be trying to design an expression, that would somewhat look like:
(?<=^#\[\d{2}\])\S*
DEMO 1
Or maybe just:
^(#\[\d+\])\S*
DEMO 2
Use the multiline option with these (regex.MULTILINE or something)
Two ways to do it
^#\[27\](.*)
or
^#\[27\]([^\r\n]*)
The thing is that \r\n is not needed to stop the match on the line.
It goes to the end without matching them.
This is advantageous if the line is the last one in the file.

Notepad++ Regex - Issue with ^ anchor and repeating patterns

When one tries to remove some characters from the start of a line and the anchored pattern can be found again after the first replace, it will be removed again.
For a very simple example given the input 012345, search pattern ^. and empty replacement, Notepad++ will remove the whole line when using replace all. This is most likely due to the case, that the cursor is still at the start of the line after the first replace and thus matches the ^ anchor again.
How can one ensure that only the actual first character is removed (in my case the expected output would be 12345)?
You can see my workaround in my answer, but maybe there is another nice trick to achieve it.
One can match the rest of the line, capture the match into a group and then use this group as replacement. The pattern in the question could be adjusted to ^.(.*) and be replaced by $1.
This will force the cursor to move forward in the string, so the ^ anchor can't match again.
Another workaround could be finding:
^.(.)?
and replacing it with:
\1
I'm sure this is a subject of a bug report but couldn't find it as of now. In N++:
Anchors are buggy
By Replace All functionality, replacements are supposed to not be a subject to re-matching. But they are, when replacement strings are invisible / zero-length characters.
Take care of them.

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Vim S&R to remove number from end of InstallShield file

I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES&gt
PATH_TO_MY_FILES1&gt
...
PATH_TO_MY_FILES22&gt
PATH_TO_MY_FILES_ELSEWHERE&gt
PATH_TO_MY_FILES_ELSEWHERE1&gt
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)&gt:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)&gt:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22&gt to get replaced with PATH_TO_MY_FILES2&gt. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22&gtgt
So basically, it looks like it finds PATH_TO_MY_FILES22&gt, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)&gt
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.