Regex taking whole line instead of exact words

Regex taking whole line instead of exact words - regex

I'm trying to match a list of different items in a text. I created a regex but it matches the whole string instead of each seperate item.
This is my current regex:
\[[a-zA-Z]\](.*)\. {1}
My test text:
[step 1] test blahblah blah [A] test item 1. [B] test item 2.
The current regex matches:
[A] test item 1. [B] test item 2.
In 1 string instead of 2 matches.

I think you want to have non-greedy behaviour:
\[[a-zA-Z]\](.*?)\. {1}
Note the question mark (?). It says that the expression coming right before it should match as little as possible to fullfil the expression. Basically, it makes it stop before the first dot and not the last one.
Proof

Related

Regular expression repetition in Ruby [duplicate]

I'm reading the regular expressions reference and I'm thinking about ? and ?? characters. Could you explain me with some examples their usefulness? I don't understand them enough.
thank you

This is an excellent question, and it took me a while to see the point of the lazy ?? quantifier myself.
? - Optional (greedy) quantifier
The usefulness of ? is easy enough to understand. If you wanted to find both http and https, you could use a pattern like this:
https?
This pattern will match both inputs, because it makes the s optional.
?? - Optional (lazy) quantifier
?? is more subtle. It usually does the same thing ? does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ? vs. ?? (or * vs. *?, or + vs. +?).
Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:
Input:
http123
https456
httpsomething
Expected result:
Pass/Fail Group 1 Group 2
Pass http 123
Pass https 456
Pass http something
You try the first thing that comes to mind, which is this:
^(http)([a-z\d]+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass http s456 No
Pass http something Yes
They all pass, but you can't use the second set of results because you only wanted 456 in Group 2.
Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:
(https?)([a-z]+|\d+)
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass https omething No
Now the second input is fine, but the third one is grouped wrong because ? is greedy by default (the + is too, but the ? came first). When deciding whether the s is part of https? or [a-z]+|\d+, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s because Group 1 sucked it up.
To fix this, you make one tiny change:
(https??)([a-z]+|\d+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass http something Yes
Essentially, this means: "Match https if you have to, but see if this still passes when Group 1 is just http." The engine realizes that the s could work as part of [a-z]+|\d+, so it prefers to put it into Group 2.

The key difference between ? and ?? concerns their laziness. ?? is lazy, ? is not.
Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".
Here's an example sentence:
I own three cars.
Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ?? like so:
cars??
This says, "look for the word car or cars; if you find either, return car and nothing more".
Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ? like so:
cars?
This says, "look for the word car or cars, and return either car or cars, whatever you find".
In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ?? only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ? returns all of the match, including the optional "s".
Personally, I find myself using ? as a way of making other regular expression operators lazy (like the * and + operators) more often than I use it for simple character optionality, but YMMV.
See it in Code
Here's the above implemented in Clojure as an example:
(re-find #"cars??" "I own three cars.")
;=> "car"
(re-find #"cars?" "I own three cars.")
;=> "cars"
The item re-find is a function that takes its first argument as a regular expression #"cars??" and returns the first match it finds in the second argument "I own three cars."

Some Other Uses of Question marks in regular expressions
Apart from what's explained in other answers, there are still 3 more uses of Question Marks in regular expressions.
Negative Lookahead
Negative lookaheads are used if you want to
match something not followed by something else. The negative
lookahead construct is the pair of parentheses, with the opening
parenthesis followed by a question mark and an exclamation point. x(?!x2)
example
Consider a word There
Now, by default, the RegEx e will find the third letter e in word There.
There
^
However if you don't want the e which is immediately followed by r, then you can use RegEx e(?!r). Now the result would be:
There
^
Positive Lookahead
Positive lookahead works just the same. q(?=u) matches a q that
is immediately followed by a u, without making the u part of the
match. The positive lookahead construct is a pair of parentheses,
with the opening parenthesis followed by a question mark and an
equals sign.
example
Consider a word getting
Now, by default, the RegEx t will find the third letter t in word getting.
getting
^
However if you want the t which is immediately followed by i, then you can use RegEx t(?=i). Now the result would be:
getting
^
Non-Capturing Groups
Whenever you place a Regular Expression in parenthesis(), they
create a numbered capturing group. It stores the part of the string
matched by the part of the regular expression inside the
parentheses.
If you do not need the group to capture its match, you can optimize
this regular expression into
(?:Value)
See also this and this.

? simply makes the previous item (character, character class, group) optional:
colou?r
matches "color" and "colour"
(swimming )?pool
matches "a pool" and "the swimming pool"
?? is the same, but it's also lazy, so the item will be excluded if at all possible. As those docs note, ?? is rare in practice. I have never used it.

Running the test harness from Oracle documentation with the reluctant quantifier of the "once or not at all" match X?? shows that it works as a guaranteed always-empty match.
$ java RegexTestHarness
Enter your regex: x?
Enter input string to search: xx
I found the text "x" starting at index 0 and ending at index 1.
I found the text "x" starting at index 1 and ending at index 2.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex: x??
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It seems identical to the empty matcher.
Enter your regex:
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex:
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: x??
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

Python Regex - How to extract the third portion?

My input is of this format: (xxx)yyyy(zz)(eee)fff where {x,y,z,e,f} are all numbers. But fff is optional though.
Input: x = (123)4567(89)(660)
Expected output: Only the eeepart i.e. the number inside 3rd "()" i.e. 660 in my example.
I am able to achieve this so far:
re.search("\((\d*)\)", x).group()
Output: (123)
Expected: (660)
I am surely missing something fundamental. Please advise.
Edit 1: Just added fff to the input data format.

You could find all those matches that have round braces (), and print the third match with findall
import re
n = "(123)4567(89)(660)999"
r = re.findall("\(\d*\)", n)
print(r[2])
Output:
(660)

The (eee) part is identical to the (xxx) part in your regex. If you don't provide an anchor, or some sequencing requirement, then an unanchored search will match the first thing it finds, which is (xxx) in your case.
If you know the (eee) always appears at the end of the string, you could append an "at-end" anchor ($) to force the match at the end. Or perhaps you could append a following character, like a space or comma or something.
Otherwise, you might do well to match the other parts of the pattern and not capture them:
pattern = r'[0-9()]{13}\((\d{3})\)'

If you want to get the third group of numbers in brackets, you need to skip the first two groups which you can do with a repeating non-capturing group which looks for a set of digits enclosed in () followed by some number of non ( characters:
x = '(123)4567(89)(660)'
print(re.search("(?:\(\d+\)[^(]*){2}(\(\d+\))", x).group(1))
Output:
(660)
Demo on rextester

regex to match what is not matched

I have a regex to search through just under 2 million product numbers: -([A-Za-z0-9]{1-5})$ to match the MFG code (last few letters after the last dash) for example, G4F,XB-RJG4 SJG2G-TRMH would match -TRMH. This was supposed to match every string on my list, however, I am a couple thousand short. This probably means that some were formatted wrong.
what could I do to match a string that doesn't end in -XXXXX, -XXXX, -XXX, or -XX, or in other words, match what is not matched?

Just two steps:
In the search dialog, tick "Bookmark line"
After the search is done, click "Search -> Bookmark -> Inverse Bookmark"
Alternatively, in step 2: "Search -> Bookmark -> Remove bookmarked lines"; afterwards, only the lines that didn't match the regular expression remain.

Question marks in regular expressions

I'm reading the regular expressions reference and I'm thinking about ? and ?? characters. Could you explain me with some examples their usefulness? I don't understand them enough.
thank you

This is an excellent question, and it took me a while to see the point of the lazy ?? quantifier myself.
? - Optional (greedy) quantifier
The usefulness of ? is easy enough to understand. If you wanted to find both http and https, you could use a pattern like this:
https?
This pattern will match both inputs, because it makes the s optional.
?? - Optional (lazy) quantifier
?? is more subtle. It usually does the same thing ? does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ? vs. ?? (or * vs. *?, or + vs. +?).
Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:
Input:
http123
https456
httpsomething
Expected result:
Pass/Fail Group 1 Group 2
Pass http 123
Pass https 456
Pass http something
You try the first thing that comes to mind, which is this:
^(http)([a-z\d]+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass http s456 No
Pass http something Yes
They all pass, but you can't use the second set of results because you only wanted 456 in Group 2.
Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:
(https?)([a-z]+|\d+)
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass https omething No
Now the second input is fine, but the third one is grouped wrong because ? is greedy by default (the + is too, but the ? came first). When deciding whether the s is part of https? or [a-z]+|\d+, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s because Group 1 sucked it up.
To fix this, you make one tiny change:
(https??)([a-z]+|\d+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass http something Yes
Essentially, this means: "Match https if you have to, but see if this still passes when Group 1 is just http." The engine realizes that the s could work as part of [a-z]+|\d+, so it prefers to put it into Group 2.

The key difference between ? and ?? concerns their laziness. ?? is lazy, ? is not.
Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".
Here's an example sentence:
I own three cars.
Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ?? like so:
cars??
This says, "look for the word car or cars; if you find either, return car and nothing more".
Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ? like so:
cars?
This says, "look for the word car or cars, and return either car or cars, whatever you find".
In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ?? only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ? returns all of the match, including the optional "s".
Personally, I find myself using ? as a way of making other regular expression operators lazy (like the * and + operators) more often than I use it for simple character optionality, but YMMV.
See it in Code
Here's the above implemented in Clojure as an example:
(re-find #"cars??" "I own three cars.")
;=> "car"
(re-find #"cars?" "I own three cars.")
;=> "cars"
The item re-find is a function that takes its first argument as a regular expression #"cars??" and returns the first match it finds in the second argument "I own three cars."

? simply makes the previous item (character, character class, group) optional:
colou?r
matches "color" and "colour"
(swimming )?pool
matches "a pool" and "the swimming pool"
?? is the same, but it's also lazy, so the item will be excluded if at all possible. As those docs note, ?? is rare in practice. I have never used it.

Running the test harness from Oracle documentation with the reluctant quantifier of the "once or not at all" match X?? shows that it works as a guaranteed always-empty match.
$ java RegexTestHarness
Enter your regex: x?
Enter input string to search: xx
I found the text "x" starting at index 0 and ending at index 1.
I found the text "x" starting at index 1 and ending at index 2.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex: x??
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It seems identical to the empty matcher.
Enter your regex:
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex:
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: x??
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

Substitute the n-th occurrence of a word in vim

I saw other questions dealing with the finding the n-th occurrence of a word/pattern, but I couldn't find how you would actually substitute the n-th occurrence of a pattern in vim. There's the obvious way of hard coding all the occurrences like
:s/.*\(word\).*\(word\).*\(word\).*/.*\1.*\2.*newWord.*/g
Is there a better way of doing this?

For information,
s/\%(\(pattern\).\{-}\)\{41}\zs\1/2/
also works to replace 42th occurrence. However, I prefer the solution given by John Kugelman which is more simple -- even if it will not limit itself to the current line.

You can do this a little more simply by using multiple searches. The empty pattern in the :s/pattern/repl/ command means replace the most recent search result.
:/word//word//word/ s//newWord/
or
:/word//word/ s/word/newWord/
You could then repeat this multiple times by doing #:, or even 10#: to repeat the command 10 more times.
Alternatively, if I were doing this interactively I would do something like:
3/word
:s//newWord/r
That would find the third occurrence of word starting at the cursor and then perform a substitution.

Replace each Nth occurrence of PATTERN in a line with REPLACE.
:%s/\(\zsPATTERN.\{-}\)\{N}/REPLACE/

To replace the nth occurrence of PATTERN in a line in vim, in addtion to the above answer I just wanted to explain the pattern matching i.e how it is actually working for easy understanding.
So I will be discussing the \(.\{-}\zsPATTERN\)\{N} solution,
The example I will be using is replacing the second occurrence of more than 1 space in a sentence(string).
According to the pattern match code->
According to the zs doc,
\zs - Scroll the text horizontally to position the cursor at the start (left
side) of the screen.
.\{-} 0 or more as few as possible (*)
Here . is matching any character and {} the number of times.
e.g ab{2,3}c here it will match where b comes either 2 or 3 times.
In this case, we can also use .* which is 0 or many as many possible.
According to vim non-greedy docs, "{-}" is the same as "*" but uses the shortest match first algorithm.
\{N} -> Matches n of the preceding atom
/\<\d\{4}\> search for exactly 4 digits, same as /\<\d\d\d\d>
**ignore these \<\> they are for exact searching, like search for fred -> \<fred\> will only search fred not alfred.
\( \) combining the whole pattern.
PATTERN here is your pattern you are matching -> \s\{1,} (\s - space and {1,} as explained just above, search for 1 or more space)
"abc subtring def"
:%s/\(.\{-}\zs\s\{1,}\)\{2}/,/
OUTPUT -> "abc subtring,def"
# explanation: first space would be between abc and substring and second
# occurence of the pattern would be between substring and def, hence that
# will be replaced by the "," as specified in replace command above.

This answers your actual question, but not your intent.
You asked about replacing the nth occurrence of a word (but seemed to mean "within a line"). Here's an answer for the question as asked, in case someone finds it like I did =)
For weird tasks (like needing to replace every 12th occurrence of "dog" with "parrot"), I like to use recursive recordings.
First blank the recording in #q
qqq
Now start a new recording in q
qq
Next, manually do the thing you want to do (using the example above, replace the 12th occurrence of "dog" with "parrot"):
/dog
nnnnnnnnnnn
delete "dog" and get into insert
diwi
type parrot
parrot
Now play your currently empty "#q" recording
#q
which does nothing.
Finally, stop recording:
q
Now your recording in #q calls itself at the end. But because it calls the recording by name, it won't be empty anymore. So, call the recording:
#q
It will replay the recording, then at the end, as the last step, replay itself again. It will repeat this until the end of the file.
TLDR;
qq
q
/dog
nnnnnnnnnnndiwiparrot<esc>
#q
q
#q

Well, if you do /gc then you can count the number of times it asks you for confirmation, and go ahead with the replacement when you get to the nth :D

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js