How does \w+ select whole word? - regex

In a simple regular expression, I understand that
\w
gives a single word character; however, I do not get how adding a plus (+) like so:
\w+
selects the whole word. In my mind, the plus just means one or more of the word character, so I do not understand how it would expand out to whole words.

Similar to how [0-9]+ means one or more digits, where each digit may be different, similarly \w+ means one or more word characters, again where each character may be different. In this normal "greedy" mode it keeps on going until it can't find any more. (You can also make it non-greedy, finding as few as possible while still allowing the regex to match) via \w+? in some regex flavors.)
If you wanted what you expect, to require the same character repeatedly, you'd need to use back-references:
(\w)\1* - Find any word character, capture it, and then find zero or more of that same character.
One character at a time example
With the regex \w+ and the input string Hello World the regex will start at the beginning and say to itself, "Is the next character matched by \w? Yes, so we add it to the result and then move forward one character." Because of the + modifier, after doing this once it keeps on doing it, one step at a time, until it cannot find any more. At this point it moves on to the next part of the regular expression (if there is one) or it stops. With just \w+ this captures all of Hello but not the space or World.
A note on Backtracking
The default + modifier enables "backtracking". This is a (sometimes-expensive) feature of regex that allows you to express your desire simply while giving the best chance of succeeding. For example, if your regex was \w+l and your input string was Hello World, the regex engine would capture all of Hello, and then say "Oh dear, now I need to find an l. There isn't one after the o...maybe I went too far?" It will back up until it has captured Hell and see if there is an l next (there isn't), and then back up again to just Hel and see if there is an l next (there is). The end result will be capturing just the string Hell.
Even more interesting is the case of the regex \w+r and the input string "Hello World". In this case the engine will capture all of Hello and see if there is an r following it. Since it does not find one, it backtracks one character at a time, until it finds out that H isn't followed by an r at this point it says "Maybe starting with the H wasn't a good idea" and goes forward in the string. Eventually it finds World, then backtracks to capture just Wo and finds that there is, finally, the r it needs. At this point it returns Wor as the match.

When adding the + it matches 1 or more of the preceeding tokens.
Its called a greedy match, and will match as many characters as possible before satisfying the next token.
http://regexr.com is a great tool for using regex and it also explains what the operators do.

The + is a greedy quantifier. It means that it will match as many characters as possible, even if there are "lesser" matches that would be valid.
In the string Hello world, \w+ matches Hello and world.
Appending a ? to it makes it non-greedy and it will be satisfied with the minimal valid match.
In the string Hello world, \w+? matches every letter separately.

Related

Why put a lookahead at the beginning, or a lookbehind at the end? [duplicate]

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.
When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.
Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:
^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$
The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.
For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.
Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:
[Tt]here\b(?<!"[Tt]here)
So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.
It's easier to show in an example than explain, I think. Let's take this regex:
(?<=\d)(?=(.)\1)(?!p)\w(?<!q)
What this means is:
(?<=\d) - make sure what comes before the match position is a digit.
(?=(.)\1) - make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).
(?!p) - make sure what follows is not a p.
\w - match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.
(?<!q) - make sure what we matched so far doesn't end with a q.
All this will match strings like abc5ddx or 9xx but not 5d or 6qq or asd6pp or add. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.
Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?, *, and + in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.
A sample run of this regex on the string a3b5ddx is as follows:
Text cursor position: 0.
Try to match the first lookbehind at position -1 (since \d always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
Text cursor position: 1.
Try to match the first lookbehind at position 0. a does not match \d so fail and advance the cursor again.
Text cursor position: 2.
Try to match the first lookbehind at position 1. 3 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 2. b matches (.) and is captured. 5 does not match \1 (which is the captured b). Therefore, fail and advance the cursor.
Text cursor position: 3.
Try to match the first lookbehind at position 2. b does not match \d so fail and advance the cursor again.
Text cursor position: 4.
Try to match the first lookbehind at position 3. 5 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 4. d matches (.) and is captured. The second d does match \1 (which is the first captured d). Allow the matching to continue from where we left off.
Try to match the second lookahead. b at position 4 does not match p, and since this is a negative lookahead, that's what we want; allow the matching to continue.
Try to match \w at position 4. b matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
Text cursor position: 5.
Try to match the second lookbehind at position 4 (since q always matches 1 character). d does not match q which is what we want from a negative lookbehind.
Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is d.
1(?=ABC) means - look for 1, and match (but don't capture) ABC after it.
(?<=ABC)1 means - match (but don't capture) ABC before the current location, and continue to match 1.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.
When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND of regexs). For example, take a look on this recent answer by Daniel Brückner:
.&.(?<! & )
First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S would not work here, the OP wanted to capture 1&_).

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

Alternation usage creates strange behavior

I am using this regex to catch the "e"s at the end of a string.
e\b|e[!?.:;]
It works but the thing I don't understand, when this encounters an input like
"space."
It only takes the "e", not including the "." but the regex has [!?.:;], which suggests it should capture the dot also.
If I remove the e\b| in the beginning, it captures the dot too. This is no problem for me because I was already trying to capture the letter only, however, I need this behavior to be explained.
The regex engine stops searching as soon as it finds a valid match.
The order of the alternatives matters, and since e is first matched, the engine will stop looking for the right side of the alternation.
In your case, the regex engine starts at the first token in "space.", it doesn't match. Then it moves to the second one, the "p". It still doesn't match.. It keeps trying to match tokens until it finally reaches the "e", and matches the left side of the alternation - when this happens, it doesn't proceed since a match was found.
I highly advise you to go through this tutorial, it gives a very good explanation on that.
If you need to make sure the . is returned in the match, just swap the alternatives:
e[!?.:;]|e\b
In NFA regex, the first alternative matched wins. There are also some different aspects here to consider, too, but this is out of scope here.
More details can be found here:
Why regex engine choose to match pattern ..X from .X|..X|X.?
Lazy quantifier {,}? not working as I would expect
In this case, here is what is going on: \b after e requires a non-word character after it. Since . is a non-word character, it satisfies the condition, that is why e\b (being the first alternative branch) wins with e[!?.:;] as both are able to match the same substring at that location.

Why does this regex using *+ (possessive) not match

I need to get the last match of [0-9.]* in a string like
one 1.234 three
some text 1.2321 xyz 1 5 1.234 and more text
some other text
but also need the text around it - even when there is no number like in the 3rd line
I wanted to use ^(.*)([0-9\.]*+)(.*)$ but it just matches the first (.*).
On the other hand ^(.*?)([0-9\.]*+)(.*?)$ just matches the last (.*?).
Why is that? I thought that it will try to satisfy all rules?
I know that I can exclude 0-9. from the last .* to get what I want, but I want to understand why the above isn't working although I used *+
A possessive quantifier doesn't guarantee the longest possible match, it just prevents backtracking. Neither of your regexes ever tries to backtrack, so the possessive quantifier has no effect.
With the first regex, the first (.*) consumes the whole string, then ([0-9.]*+) and the second (.*) each consume nothing because there's nothing left to match.
With the second regex, the first (.*?) initially consumes nothing because it's reluctant. Then ([0-9.]*+) successfully matches some more nothing because it's still at the beginning of the string, which doesn't happen to start with a digit or a period. Finally, the last (.*?) is forced to consume what's left (the whole string) despite being reluctant, because of the anchor ($) following it.
To solve your problem, we need to know more about the kind of input you can expect. For example, if you know there will never be any digits or periods after the number you're looking for, you could use this:
^(.*?)(?:([0-9.]+)([^0-9.]*))?$
The key here is that the second capturing group, ([0-9.]+), uses a + instead of a *. If there are no digits or periods in the string, the enclosing group, (?:([0-9.]+)([^0-9.]*))?, will match nothing, and the initial (.*?) will be forced to consume the whole string. (The second and third groups will be empty.)
If there's more than one sequence of digits or periods in the string, the second group is guaranteed to match the last of them, because the third group, ([^0-9.]*), allows anything but those characters in the remainder of the string.
This is pretty weak, but it's the best I can do with the information you've supplied. The point is, possessive quantifiers are brilliant when you can use them, but that doesn't happen nearly as often as you might expect.

Regex lookahead ordering

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.
When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.
Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:
^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$
The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.
For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.
Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:
[Tt]here\b(?<!"[Tt]here)
So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.
It's easier to show in an example than explain, I think. Let's take this regex:
(?<=\d)(?=(.)\1)(?!p)\w(?<!q)
What this means is:
(?<=\d) - make sure what comes before the match position is a digit.
(?=(.)\1) - make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).
(?!p) - make sure what follows is not a p.
\w - match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.
(?<!q) - make sure what we matched so far doesn't end with a q.
All this will match strings like abc5ddx or 9xx but not 5d or 6qq or asd6pp or add. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.
Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?, *, and + in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.
A sample run of this regex on the string a3b5ddx is as follows:
Text cursor position: 0.
Try to match the first lookbehind at position -1 (since \d always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
Text cursor position: 1.
Try to match the first lookbehind at position 0. a does not match \d so fail and advance the cursor again.
Text cursor position: 2.
Try to match the first lookbehind at position 1. 3 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 2. b matches (.) and is captured. 5 does not match \1 (which is the captured b). Therefore, fail and advance the cursor.
Text cursor position: 3.
Try to match the first lookbehind at position 2. b does not match \d so fail and advance the cursor again.
Text cursor position: 4.
Try to match the first lookbehind at position 3. 5 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 4. d matches (.) and is captured. The second d does match \1 (which is the first captured d). Allow the matching to continue from where we left off.
Try to match the second lookahead. b at position 4 does not match p, and since this is a negative lookahead, that's what we want; allow the matching to continue.
Try to match \w at position 4. b matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
Text cursor position: 5.
Try to match the second lookbehind at position 4 (since q always matches 1 character). d does not match q which is what we want from a negative lookbehind.
Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is d.
1(?=ABC) means - look for 1, and match (but don't capture) ABC after it.
(?<=ABC)1 means - match (but don't capture) ABC before the current location, and continue to match 1.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.
When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND of regexs). For example, take a look on this recent answer by Daniel Brückner:
.&.(?<! & )
First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S would not work here, the OP wanted to capture 1&_).