I'm learning regex via regexr.com so that I can be less embarrassingly pathetic when trying to match patterns.
The website provides an explanation for each component of the regex statement, but I'm unable to determine why this expression:
/([o])\w+/g
doesn't match any part of the word "to":
My understanding is that [o] should match the letter o and the \w switch (or whatever you'd call that... flag? option?) tells it to match words.
I would also benefit from an explanation of why it matches only o and the letters after o within a word (e.g. ome in the word Welcome) rather than the entire word containing the letter o).
Finally, the explanation of + tells me that it means to "match 1 or more of the preceding token" while toggling this seems to control if only 1 letter after o is matched, or all of the letters after o in the word is matched. Clarification on this would be greatly appreciated.
My apologies for the novice questions.
\w is not a switch, it's a character class for word characters. The exact meaning of \w depends on the system, but at the minimum it must match [A-Za-z0-9_]. In your example in "to" letter "o" is followed by a space, which is a non-word character. Since the + qualifier requires one or more word characters following "o", the word "to" does not match.
Actually \b\w+\b would match the word to (not in towards or in toe).
\b is a word boundary, while \w matches any word character. \w+ matches at least one word character, unlimited times consecutively.
RegexOne is a good starting point to learn regular expressions.
Related
So, I know how to find a string with specified length and how to find a string that has specified letter. But how can I find a string that matches both conditions?For example I want to find a 4 letter string that has letter "g".What I did:\b[A-Za-z].[Gg][A-Za-z].\bthis regex matches any word that has letter "g". So now I need to limit length, but when I try\b([A-Za-z].[Gg][A-Za-z].){4}\bit fails
To match only ASCII-letter sequences with length of 4 containing a specific letter, you can use
\b(?=\w*[Gg])[a-zA-Z]{4}\b
See the regex demo
The regex breakdown:
\b - a word boundary (we need the next letter to be a word character: [a-zA-Z0-9_], but we'll restrict it to [a-zA-Z] with the subsequent consuming pattern)
(?=\w*[Gg]) - a positive lookahead that makes sure there is at least one g or G in the word (\w* matches 0 or more alphanumeric symbols)
[a-zA-Z]{4} - 4 ASCII letters
\b - trailing word boundary
Already answer here by #Alan Moore
You just have to adapt :
(?<!\S)(?=[a-zA-Z]{4}(?!\S))\S*[gG]\S*
(?<!\S) matches a position that is not preceded by a non-whitespace
character.
(?=[a-zA-Z]{4}(?!\S)) further asserts that the position is
followed by exactly 4 letters.
Once the lookarounds
are satisfied, \S*[gG]\S* goes ahead and consumes the string,
assuming at least one of the characters is g or G.
Can anyone explain the difference between \b and \w regular expression metacharacters? It is my understanding that both these metacharacters are used for word boundaries. Apart from this, which meta character is efficient for multilingual content?
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is
a word character.
After the last character in the string, if the
last character is a word character.
Between two characters in the
string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w. Flavors showing "ascii" for word boundaries in the flavor comparison recognize only these as word characters.
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.
\W is short for [^\w], the negated version of \w.
\w matches a word character. \b is a zero-width match that matches a position character that has a word character on one side, and something that's not a word character on the other. (Examples of things that aren't word characters include whitespace, beginning and end of the string, etc.)
\w matches a, b, c, d, e, and f in "abc def"
\b matches the (zero-width) position before a, after c, before d, and after f in "abc def"
See: http://www.regular-expressions.info/reference.html/
#Mahender, you probably meant the difference between \W (instead of \w) and \b. If not, then I would agree with #BoltClock and #jwismar above. Otherwise continue reading.
\W would match any non-word character and so its easy to try to use it to match word boundaries. The problem is that it will not match the start or end of a line. \b is more suited for matching word boundaries as it will also match the start or end of a line. Roughly speaking (more experienced users can correct me here) \b can be thought of as (\W|^|$). [Edit: as #Ωmega mentions below, \b is a zero-length match so (\W|^|$) is not strictly correct, but hopefully helps explain the diff]
Quick example: For the string Hello World, .+\W would match Hello_ (with the space) but will not match World. .+\b would match both Hello and World.
\b <= this is a word boundary.
Matches at a position that is followed by a word character but not preceded by a word character, or that is preceded by a word character but not followed by a word character.
\w <= stands for "word character".
It always matches the ASCII characters [A-Za-z0-9_]
Is there anything specific you are trying to match?
Some useful regex websites for beginners or just to wet your appetite.
http://www.regular-expressions.info
http://www.javascriptkit.com/javatutors/redev2.shtml
http://www.virtuosimedia.com/dev/php/37-tested-php-perl-and-javascript-regular-expressions
http://www.i-programmer.info/programming/javascript/4862-master-javascript-regular-expressions.html
I found this to be a very useful book:
Mastering Regular Expressions by Jeffrey E.F. Friedl
\w is not a word boundary, it matches any word character, including underscores: [a-zA-Z0-9_]. \b is a word boundary, that is, it matches the position between a word and a non-alphanumeric character: \W or [^\w].
These implementations may vary from language to language though.
So I want to find the string "to" in a string, but only when it is standalone. It could be at the beginning of the string, as in "to do this", so I can't search " to ".
What I want to do is say, if there is a character behind "to", it cannot be \w. How do I do that?
Try word boudaries. It matches the beginning and the end of the searched pattern
\bto\b
This is exaclty what you want to say, i.e.
So what exactly is it that \b matches? Regular expression engines do not understand English, or any language for that matter, and so they don't know what word boundaries are. \b simply matches a location between characters that are usually parts of words (alphanumeric characters and underscore, text that would be matched by \w) and anything else (text that would be matched by \W).
Sams Teach Yourself Regular Expressions in 10 Minutes
By Ben Forta
Try using \bto\b, which will match to as a stand-alone word
Here's a good explanation:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search
using a regular expression in the form of \bword\b. A "word
character" is a character that can be used to form words. All
characters that are not "word characters" are "non-word characters".
If I have a sentence and I wish to display a word or all words after a particular word has been matched ahead of it, for example I would like to display the word fox after brown The quick brown fox jumps over the lazy dog, I know I can look positive look behinds e.g. (?<=brown\s)(\w+) however I don't quite understand the use of \b in the instance (?<=\bbrown\s)(\w+). I am using http://gskinner.com/RegExr/ as my tester.
\b is a zero width assertion. That means it does not match a character, it matches a position with one thing on the left side and another thing on the right side.
The word boundary \b matches on a change from a \w (a word character) to a \W a non word character, or from \W to \w
Which characters are included in \w depends on your language. At least there are all ASCII letters, all ASCII numbers and the underscore. If your regex engine supports unicode, it could be that there are all letters and numbers in \w that have the unicode property letter or number.
\W are all characters, that are NOT in \w.
\bbrown\s
will match here
The quick brown fox
^^
but not here
The quick bbbbrown fox
because between b and brown is no word boundary, i.e. no change from a non word character to a word character, both characters are included in \w.
If your regex comes to a \b it goes on to the next char, thats the b from brown. Now the \b know's whats on the right side, a word char ==> the b. But now it needs to look back, to let the \b become TRUE, there needs to be a non word character before the b. If there is a space (thats not in \w) then the \b before the b is true. BUT if there is another b then its false and then \bbrown does not match "bbrown"
The regex brown would match both strings "quick brown" and "bbrown", where the regex \bbrown matches only "quick brown" AND NOT "bbrown"
For more details see here on www.regular-expressions.info
The \b token is kind of special. It doesn't actually match a character. What it does is it matches any position that lies at the boundary of a word (where "word" in this case is anything that matches \w). So the pattern (?<=brown\s)(\w+) would match "bbbbrown fox", but (?<=\bbrown\s)(\w+) wouldn't, since the position between "bb" and "brown" is in the middle of a word, not at its boundary.
\b is a "word boundary" and is the position between the start or end of a word and then "non-word" characters.
Its main use is to simplify the selection of a whole word to \bbrown\s will match:
^brown
brown
99brown
_brown
Its more or less equivalent to "\W*" except when "capturing" strings as "\b" matches the start of the word rather than the non-word character preceding or following the word.
\b is a zero width match of a word boundary.
(Either start of end of a word, where "word" is defined as \w+)
Note: "zero width" means if the \b is within a regex that matches, it does not add any characters to the text captured by that match. ie the regex \bfoo\b when matched will capture just "foo" - although the \b contributed to the way that foo was matched (ie as a whole word), it didn't contribute any characters.
A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. It's equivalent to this:
(?<=\w)(?!\w)|(?=\w)(?<!\w)
...or it's supposed to be. See this question for everything you ever wanted to know about word boundaries. ;)
\b guarantees that brown is on a word boundary effectively excluding patterns like
blackandbrown
You don't need a look behind, you can simply use:
(\bbrown\s)(\w+)
From regular-expressions.info:
\b\w+(?<!s)\b. This is definitely not the same as \b\w+[^s]\b. When applied to Jon's, the former will match Jon and the latter Jon' (including the apostrophe). I will leave it up to you to figure out why. (Hint: \b matches between the apostrophe and the s). The latter will also not match single-letter words like "a" or "I".
Can you explain why ?
Also, can you make clear what exacly \b does, and why it matches between the apostrophe and the s ?
\b is a zero-width assertion that means word boundary. These character positions (taken from that link) are considered word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are of course any \w. s is a word character, but ' is not. In the above example, the area between the ' and the s is a word boundary.
The string "Jon's" looks like this if I highlight the anchors and boundaries (the first and last \bs occur in the same positions as ^ and $): ^Jon\b'\bs$
The negative lookbehind assertion (?<!s)\b means it will only match a word boundary if it's not preceded by the letter s (i.e. the last word character is not an s). So it looks for a word boundary under a certain condition.
Therefore the first regex works like this:
\b\w+ matches the first three letters J o n.
There's actually another word boundary between n and ' as shown above, so (?<!s)\b matches this word boundary because it's preceded by an n, not an s.
Since the end of the pattern has been reached, the resultant match is Jon.
The complementary character class [^s]\b means it will match any character that is not the letter s, followed by a word boundary. Unlike the above, this looks for one character followed by a word boundary.
Therefore the second regex works like this:
\b\w+ matches the first three letters J o n.
Since the ' is not the letter s (it fulfills the character class [^s]), and it's followed by a word boundary (between ' and s), it's matched.
Since the end of the pattern has been reached, the resultant match is Jon'. The letter s is not matched because the word boundary before it has already been matched.
The example is trying to demonstrate that lookaheads and lookbehinds can be used to create "and" conditions.
\b\w+(?<!s)\b
could also be written as
\b\w*\w(?<!s)\b
That gives us
\b\w*[^s]\b vs \b\w*\w(?<!s)\b
I did that so we can ignore the irrelevant. (The \b are simply distractions in this example.) We have
[^s] vs \w(?<!s)
On the left, we can match any character except "s"
On the right, we can match any word character except "s"
By the way,
\w(?<!s)
could also be written
(?!s)\w # Not followed by "s" and followed by \w