Why does this particular Vim RegEx string work? - regex

I had spent a while trying to narrow down a way of retrieving only web links from a few thousand lines that ended with either jpg or png.
If I use
%s/\(http.*\(jpg\|png\)\)\=\(.*\|\_s\)/\1/g|%s/\n\=
I can grab links just fine. The some thousands of lines are removed and replaced by only matching links. But if I remove the first \=, like here
%s/\(http.*\(jpg\|png\)\)\(.*\|\_s\)/\1/g|%s/\n\=
nothing in the file is changed or removed, and all the text is highlighted as a match.
If I remove it from the end of the pattern string, it concatenates every match onto a single line. I understand the basic reason for why this happens (being used by itself). That said, I am lost as to why it does not happen the same way when used in this specific case. (Meaning, the links do not get piled onto one line.)
My questions are:
Why do the links remain unchanged in the first example rather than replace the entire file or be removed entirely?
Why does specifying \n as an optional element not remove the nulls when the meaning of \= is "match 0 OR 1"?

Starting from the end of your regexp, with
%s/\n\=
You're substituting in every line 0 or 1 \n with //, hence and since you're not using the g flag, in any line that begins with anything but a \n, there'll be a match of the 0 part and nothing will be substituted with nothing: i.e. the line remains the same. (Led zeppelin quote)
It's equivalent to:
:%s/^\n
If you remove the \=, the first \n actually found in every line will be removed, that's why empty lines and the newlines at the end of your non empty lines get removed.
Now, here:
%s/\(http.*\(jpg\|png\)\)\=\(.*\|\_s\)/\1/g
The \= makes so that any string with 0 or 1 \(http.*\(jpg\|png\)\) patterns followed by anything (since you have \(.*\|\_s\)), will be replaced by the first saved pattern.
Basically, you're matching your whole file and preventing only this pattern: \(http.*\(jpg\|png\)\) from being removed.
When you remove \=, the 0 part of the match drops, and only in the lines that actually have the \(http.*\(jpg\|png\)\) pattern there will be a substitution of the matched pattern with itself from http up to jpg/png with anything after that being removed.
On a side note, if you save a pattern but don't use it in the substitution string, you're losing that pattern anyway.
If you actually only want to keep the http..jpg/png lines and remove the others, you can use the g! or v command:
:v/http.*jpg\|png/d
deletes all the lines that don't have the matched pattern.

Related

Bug in Notepad++ / BOOST or bug in my regular expression?

I have a file which is structured like this:
Line
foo Änderbar: PM baz
Line
Line
foo Änderbar: OM baz
Line
Line
foo Änderbar: ++ baz
Line
Line
foo Änderbar: -- baz
Line
So the file consists of "blocks" which are separated by a newline (I have converted the file to Unix line endings). Each block can have an arbitrary number of lines. Each line of a block contains at least one character which is not a newline, and is finished by a newline character. The lines which separate the blocks consist of exactly one newline character.
In each block, there is exactly one line in the following format:
at least one character which is not newline, followed by
the literal string 'Änderbar: ', followed by
exactly one of the literal strings '++', '--', 'OM', 'PM', followed by
at least one character which is not newline, followed by
the line-terminating newline character
There is always at least one other non-empty line in the same block above this special line and one other non-empty line below this special line.
I need an effective method to find (and thereby select) all blocks where the literal after Änderbar: is -- (find / select one block after another, each one after hitting Find Next again, i.e. not selecting all of those blocks at the same time).
Normally, I have fun solving such problems with Notepad++. However, in that case, it seems that I either get more and more stupid as I get older, or that there is a bug in Notepad++'s regex handling engine.
Notepad++ uses BOOST (and supports PCRE expressions via BOOST). Since this is in wide use, I consider that problem important enough to post it here, just in case that BOOST really is the reason for the misbehavior.
Having said this: I loaded that file into Notepad++, fired up the Search and Replace dialog, ticked . matches newline, ticked Regular Expression and entered the following regex in the Find What: textbox:
\n([^\n]+\n)+[^\n]+(Änderbar\:\ --[^\n]+\n)([^\n]+\n)+
I was quite surprised that this made Notepad++ behave weirdly: When the cursor was placed in the empty line immediately before a block with Änderbar: --, hitting Find Next found / selected that block as expected. But when the cursor was at another place, hitting Find Next made Notepad++ find / select the whole rest of the file, i.e. all blocks below the cursor position.
I then have tested if it would find the blocks having ++ after Änderbar:, i.e. I changed my regex to
\n([^\n]+\n)+[^\n]+(Änderbar\:\ \+\+[^\n]+\n)([^\n]+\n)+
Guess what: This was working reliably in each situation. The same is true for the last both:
\n([^\n]+\n)+[^\n]+(Änderbar\:\ PM[^\n]+\n)([^\n]+\n)+
\n([^\n]+\n)+[^\n]+(Änderbar\:\ OM[^\n]+\n)([^\n]+\n)+
So Notepad++ / PCRE seems to have a problem with the correct interpretation of - under certain circumstances, or I have a subtle bug in my regex which only triggers when I am searching for -- (instead of ++, OM or PM) at the respective place.
Please note that I already have tried to leave away the \ in front of the space character (which actually could only make the situation worse, but I've tried just in case) and that I also have tried to use \-\- instead of -- (although the latter should be fine). That did not alter the (mis-)behavior in any way.
So what is the problem here? Is there a bug in my regex, or is there a bug in Notepad++?
UPDATE
I have stripped down the actual file in question and have uploaded it to https://pastebin.com/w62E57U5. To reproduce the problem, please do the following:
Download the file from the link above and save it somewhere on your HDD (do not copy the text directly into Notepad++).
Load the file into Notepad++. The cursor now is in the topmost line, and nothing is selected.
This is essential: Click Edit -> EOL Conversion -> Unix (LF).
Verify that the cursor is still in the topmost line (which is empty) and that nothing is selected.
Open the Find dialog and choose the settings and enter the search string as described above.
Click "Find Next".
Note that now the complete text is found / selected.
Keeping the Find window open, delete the third line of the file (it reads "Funktionspaket(e): ML"). Do not just empty that line, but really delete it so that no empty line remains between the line before and the line after.
Again, place the cursor in the topmost line (which is still empty) and make sure nothing is selected.
Click "Find Next".
Note that the regular expression now works as expected.
Obviously, somebody is trying to make a fool of me, right?
I think the key is: you need to begin your regex with ^ (beginning of line).
Your original regex becomes:
^\n([^\n]+\n)+[^\n]+(Änderbar\:\ --[^\n]+\n)([^\n]+\n)+
But you can simplify it with:
^\R(?:.+\R)+.+Änderbar: --.+\R(?:.+(?:\R|\z))+
Note: tick . matches newline
Where:
\R matches any kind of linebreak, no needs to change the EOL.
\z matches the end of file, if you don't use it, you can't match the last line of the file if there're no linebreak.
(?:...) is a non capture group, much more efficient (if you don't need to capture, of course)
Both works fine with your 2 sample files.
It's not a bug. You're just forgetting something very important - with Windows line endings, your lines have a \r before the \n, so the \n([^\n]+\n)+ part of your RegEx will also match your blank lines which is why clicking "Find Next" matches everything from the cursor position instead of from the start of the block.
Go to Edit > EOL Conversion > Unix (LF) and you'll see that it works now. If you want to support Windows and Unix line endings you'll have to change every [^\n] to [^\r\n] and every \n to \r?\n.

How do I regex search in x and y for a, and only include the replacement of y if a was found in x?

I need to search through a larger text file.
This is an example of what I'm searching through.
https://pastebin.com/JFVy2TEt
recipes.addShaped("basemetals:adamantine_arrow", <basemetals:adamantine_arrow> * 4, [[<ore:nuggetAdamantine>], [<basemetals:adamantine_rod>], [<minecraft:feather>]]);
I need to look for lines that match a specific part in the first argument.
For example the "_arrow" part in the above line.
And erase everything that doesn't match on the "_arrow" in the first argument.
And the arguments differ across all of them.
And also with different names in the place where "basemetals:adamantine" is in the above line.
And since the further arguments are all different I can't wrap my head around on how to include the end only when the first thing matches.
Edit: The end goal being to ease sort my 3k+ line text file.
basic, blacksmith, carpenter, chef, chemist, engineer, farmer, jeweler, mage, mason, scribe, tailor
I think what you're trying to do is filter your text file by removing lines that don't fit a set criteria. I've chosen the Atom text editor for this solution (because I'm running Windows OS and can't install gedit, and I want to ensure you have a working example).
To remove only lines that don't have a first argument ending in _arrow, one could do (?!recipes\.addShaped\("[^"]+_arrow")recipes.+\r?\n? and replace with nothing.
As a note: this task is made more difficult by Atom's low regex support. In a more well-supported environment, my answer would probably be ^recipes\.addShaped("[^"]+(?<!_arrow)").+\r?\n? (with multiline mode).
Also, please read "What should I do when someone answers my question?".
Regex explained:
(?! ) is a negative lookahead, which peeks at the succeeding text to ensure it doesn't contain "_arrow" at end of the first argument.
\. is an escaped literal period
[^"] is a character class that signifies a character that is not a ".
+ is a quantifier which tells the regex to match the preceding character or subexpression as many times as possible, with a minimum of one time.
. is a wildcard, representing any character
\r?\n? is used to match any kind of newline, with the ? quantifier making each character optional.
Everything else it literal characters; it represents exactly what it matches.

How can I collapse multiple whitespace lines with vim?

The question and answers here cover in detail how the following vim command collapses a series of empty lines into a single line:
:g/^$/,/./-j
However, I want to do the same but also treat lines with onlywhite space in them as blank. The following command is what I tried but it doesn't work:
:g/^\s*$/,/./-j
As far as I can tell, that should find the lines that are empty and have only whitespace on them, but not all lines are being collapsed.
You're halfway there.
Remember that the initial command consisted of a search part and an action part. The search part :g/^$/ found all empty lines and the action part ,/./-j was executed for each (well, each that hadn't already been deleted by a previous j).
The modification you made to the search part of the string is correct in that it will now find lines that are either empty or contain only whitespace.
However, it's the action that you're executing after that that's causing you grief. The original action to be executed on the found line was ,/./-j which basically means execute a join j over the range from this line to the one before the next 'real' character. More detail on how this works can be found in the question you linked to.
The first 'real' character that it finds in your case actually includes whitespace so, while the search bit will find whitespace lines and act on them, the range of the join in the action will not be what you want.
What you need to specify for the end of the range in the action is the line previous to the next one that has something other than whitespace (rather than just a line with any 'real' character). A line with a non-whitespace character is simply one that matches the regex \S (the backslash with uppercase S denotes a non-whitespace character).
So, in the end, what you're looking for is:
:g/^\s*$/,/\S/-j
Having said that, keep in mind that the line that remains behind is (I think) the first from the range. So, it's not necessarily empty, it may contain white-space.
If you wish to ensure all whitespace-only lines are made empty, just execute:
:g/^\s*$/s/.*//
after the collapsing command above. Or, you can combine both into a single command using | as an action separator:
:g/^\s*$/,/\S/-j|s/.*//

RegExp adaption with new line

I've the following RegExp to find the URIs listed above:
"^w{3}\.[\S\-\n|\S]+[^\s.!?,():]+$"
URLs to find:
www.example.org
www.example-example.org
www.example-example.org/product
You'll find it at www.example-
example.org/product.
www.example.org
You'll find it there.
Number 1, 2 and 3 will be found, but 4. delivers "www.example-" as URI.
When there is no point at the end of 4. it would deliver it correct.
EDIT: With deleting ^ and $ only number 5 is not working.
Does anyone can help here?
Your pattern
^w{3}\.[\S\-\n|\S]+[^\s.!?,():]+$
can be simplified to
^w{3}\.[\S\n]+[^\s.!?,():]$
[\S\-\n|\S] this is a character class, no OR possible, no repetition needed, - is included in \S. So [\S\n] is doing the same.
[^\s.!?,():]+ because you match every non whitespace with the expression before this one, here the + is not needed. I assume you just want your pattern not to end with one of the characters from the class.
See your pattern on Regexr (I added \r to your first class, because the line breaks there needs it)
This is a very useful tool to test regexes
I think your problem is that you want to allow line breaks in the link. How do you want to handle this? How do you want to distinguish when the line ends with a link if the word in the next line is just a word or part of the link. I think this is not possible!
The problem is the '^\s' in the second squared bracketed part. Depending on your programming language, '\s' might match the new line. So, you are telling it to match anything that is not a whitespace and it finds a whitespace (new line).
However, this should only be one of your issues. Your regex uses the '^' and '$' characters which mean start and end of line respectively. Try this URL example:
hello from www.example.org
Did it match? I think it will not.

How to match the last url in a line containing multiple urls, using regular expressions?

I want to write a regex that matches a url that ends with ".mp4" given that there are multiple urls in a line.
For example, for the following line:
"http://www.link.org/1610.jpg","Debt","http://www.archive.org/610_.mp4","66196517"
Using the following pattern matches from the first http until mp4.
(http:\/\/[^"].*?\.mp4)[",].*?
How can I make it match only the last url only?
Note that, the lines may contain any number of urls and anything in between. But only the last url contains .mp4 ending.
Use:
.*"(http:\/\/[^"].*?\.mp4)".*
Wildcards are by default greedy. The first part of this will start by grabbing the entire string and then backtrack until it finds a URL. Probably not the most efficient way to do it but it doesn't really matter since you're only doing this on a line of text (unless, say, the line is tens of millions of characters long).
By the way, the piece you had at the end ([",]) wasn't quite correct. That pattern means match either " OR , when I suspect what you really mean is match that sequence (based on your sample line).
Lastly, you don't need to make the final wildcard greedy. You don't need it at all if you're doing a find rather than trying to match the entire line either.
Try with
,\s*"(http://[^"]*?\.mp4)"\s*,\s*.*$
(PCRE not using / as delimiter, using e.g. | instead); it matched http://www.archive.org/610_.mp4, if the " opens and closes a link, i.e. " link " is not allowed; otherwise, add \s*? to match those spaces too. Another maybe wrong assumption: the link is the last link, but not the last element; if it is not so, mp4)"$ could be the ending of the RE instead of the one used now.