Can anyone explain the difference between \b and \w regular expression metacharacters? It is my understanding that both these metacharacters are used for word boundaries. Apart from this, which meta character is efficient for multilingual content?
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is
a word character.
After the last character in the string, if the
last character is a word character.
Between two characters in the
string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w. Flavors showing "ascii" for word boundaries in the flavor comparison recognize only these as word characters.
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.
\W is short for [^\w], the negated version of \w.
\w matches a word character. \b is a zero-width match that matches a position character that has a word character on one side, and something that's not a word character on the other. (Examples of things that aren't word characters include whitespace, beginning and end of the string, etc.)
\w matches a, b, c, d, e, and f in "abc def"
\b matches the (zero-width) position before a, after c, before d, and after f in "abc def"
See: http://www.regular-expressions.info/reference.html/
#Mahender, you probably meant the difference between \W (instead of \w) and \b. If not, then I would agree with #BoltClock and #jwismar above. Otherwise continue reading.
\W would match any non-word character and so its easy to try to use it to match word boundaries. The problem is that it will not match the start or end of a line. \b is more suited for matching word boundaries as it will also match the start or end of a line. Roughly speaking (more experienced users can correct me here) \b can be thought of as (\W|^|$). [Edit: as #Ωmega mentions below, \b is a zero-length match so (\W|^|$) is not strictly correct, but hopefully helps explain the diff]
Quick example: For the string Hello World, .+\W would match Hello_ (with the space) but will not match World. .+\b would match both Hello and World.
\b <= this is a word boundary.
Matches at a position that is followed by a word character but not preceded by a word character, or that is preceded by a word character but not followed by a word character.
\w <= stands for "word character".
It always matches the ASCII characters [A-Za-z0-9_]
Is there anything specific you are trying to match?
Some useful regex websites for beginners or just to wet your appetite.
http://www.regular-expressions.info
http://www.javascriptkit.com/javatutors/redev2.shtml
http://www.virtuosimedia.com/dev/php/37-tested-php-perl-and-javascript-regular-expressions
http://www.i-programmer.info/programming/javascript/4862-master-javascript-regular-expressions.html
I found this to be a very useful book:
Mastering Regular Expressions by Jeffrey E.F. Friedl
\w is not a word boundary, it matches any word character, including underscores: [a-zA-Z0-9_]. \b is a word boundary, that is, it matches the position between a word and a non-alphanumeric character: \W or [^\w].
These implementations may vary from language to language though.
Related
how to match non-word+word boundary in javascript regex.
"This is, a beautiful island".match(/\bis,\b/)
In the above case why does not the regex engine match till is, and assume the space to be a word boundary without moving further.
\b asserts a position where a word character \w meets a non-word character \W or vice versa. Comma is a non-word character and space is as well. So \b never matches a position between a comma and a space.
Also you forgot to put ending delimiter in your regex.
You can use \B after comma that matches where \b doesn't since comma is not considered a word character.
console.log( "This is, a beautiful island".match(/\bis,\B/) )
//=> ["is,"]
So I want to find the string "to" in a string, but only when it is standalone. It could be at the beginning of the string, as in "to do this", so I can't search " to ".
What I want to do is say, if there is a character behind "to", it cannot be \w. How do I do that?
Try word boudaries. It matches the beginning and the end of the searched pattern
\bto\b
This is exaclty what you want to say, i.e.
So what exactly is it that \b matches? Regular expression engines do not understand English, or any language for that matter, and so they don't know what word boundaries are. \b simply matches a location between characters that are usually parts of words (alphanumeric characters and underscore, text that would be matched by \w) and anything else (text that would be matched by \W).
Sams Teach Yourself Regular Expressions in 10 Minutes
By Ben Forta
Try using \bto\b, which will match to as a stand-alone word
Here's a good explanation:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search
using a regular expression in the form of \bword\b. A "word
character" is a character that can be used to form words. All
characters that are not "word characters" are "non-word characters".
If I have a sentence and I wish to display a word or all words after a particular word has been matched ahead of it, for example I would like to display the word fox after brown The quick brown fox jumps over the lazy dog, I know I can look positive look behinds e.g. (?<=brown\s)(\w+) however I don't quite understand the use of \b in the instance (?<=\bbrown\s)(\w+). I am using http://gskinner.com/RegExr/ as my tester.
\b is a zero width assertion. That means it does not match a character, it matches a position with one thing on the left side and another thing on the right side.
The word boundary \b matches on a change from a \w (a word character) to a \W a non word character, or from \W to \w
Which characters are included in \w depends on your language. At least there are all ASCII letters, all ASCII numbers and the underscore. If your regex engine supports unicode, it could be that there are all letters and numbers in \w that have the unicode property letter or number.
\W are all characters, that are NOT in \w.
\bbrown\s
will match here
The quick brown fox
^^
but not here
The quick bbbbrown fox
because between b and brown is no word boundary, i.e. no change from a non word character to a word character, both characters are included in \w.
If your regex comes to a \b it goes on to the next char, thats the b from brown. Now the \b know's whats on the right side, a word char ==> the b. But now it needs to look back, to let the \b become TRUE, there needs to be a non word character before the b. If there is a space (thats not in \w) then the \b before the b is true. BUT if there is another b then its false and then \bbrown does not match "bbrown"
The regex brown would match both strings "quick brown" and "bbrown", where the regex \bbrown matches only "quick brown" AND NOT "bbrown"
For more details see here on www.regular-expressions.info
The \b token is kind of special. It doesn't actually match a character. What it does is it matches any position that lies at the boundary of a word (where "word" in this case is anything that matches \w). So the pattern (?<=brown\s)(\w+) would match "bbbbrown fox", but (?<=\bbrown\s)(\w+) wouldn't, since the position between "bb" and "brown" is in the middle of a word, not at its boundary.
\b is a "word boundary" and is the position between the start or end of a word and then "non-word" characters.
Its main use is to simplify the selection of a whole word to \bbrown\s will match:
^brown
brown
99brown
_brown
Its more or less equivalent to "\W*" except when "capturing" strings as "\b" matches the start of the word rather than the non-word character preceding or following the word.
\b is a zero width match of a word boundary.
(Either start of end of a word, where "word" is defined as \w+)
Note: "zero width" means if the \b is within a regex that matches, it does not add any characters to the text captured by that match. ie the regex \bfoo\b when matched will capture just "foo" - although the \b contributed to the way that foo was matched (ie as a whole word), it didn't contribute any characters.
A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. It's equivalent to this:
(?<=\w)(?!\w)|(?=\w)(?<!\w)
...or it's supposed to be. See this question for everything you ever wanted to know about word boundaries. ;)
\b guarantees that brown is on a word boundary effectively excluding patterns like
blackandbrown
You don't need a look behind, you can simply use:
(\bbrown\s)(\w+)
From regular-expressions.info:
\b\w+(?<!s)\b. This is definitely not the same as \b\w+[^s]\b. When applied to Jon's, the former will match Jon and the latter Jon' (including the apostrophe). I will leave it up to you to figure out why. (Hint: \b matches between the apostrophe and the s). The latter will also not match single-letter words like "a" or "I".
Can you explain why ?
Also, can you make clear what exacly \b does, and why it matches between the apostrophe and the s ?
\b is a zero-width assertion that means word boundary. These character positions (taken from that link) are considered word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are of course any \w. s is a word character, but ' is not. In the above example, the area between the ' and the s is a word boundary.
The string "Jon's" looks like this if I highlight the anchors and boundaries (the first and last \bs occur in the same positions as ^ and $): ^Jon\b'\bs$
The negative lookbehind assertion (?<!s)\b means it will only match a word boundary if it's not preceded by the letter s (i.e. the last word character is not an s). So it looks for a word boundary under a certain condition.
Therefore the first regex works like this:
\b\w+ matches the first three letters J o n.
There's actually another word boundary between n and ' as shown above, so (?<!s)\b matches this word boundary because it's preceded by an n, not an s.
Since the end of the pattern has been reached, the resultant match is Jon.
The complementary character class [^s]\b means it will match any character that is not the letter s, followed by a word boundary. Unlike the above, this looks for one character followed by a word boundary.
Therefore the second regex works like this:
\b\w+ matches the first three letters J o n.
Since the ' is not the letter s (it fulfills the character class [^s]), and it's followed by a word boundary (between ' and s), it's matched.
Since the end of the pattern has been reached, the resultant match is Jon'. The letter s is not matched because the word boundary before it has already been matched.
The example is trying to demonstrate that lookaheads and lookbehinds can be used to create "and" conditions.
\b\w+(?<!s)\b
could also be written as
\b\w*\w(?<!s)\b
That gives us
\b\w*[^s]\b vs \b\w*\w(?<!s)\b
I did that so we can ignore the irrelevant. (The \b are simply distractions in this example.) We have
[^s] vs \w(?<!s)
On the left, we can match any character except "s"
On the right, we can match any word character except "s"
By the way,
\w(?<!s)
could also be written
(?!s)\w # Not followed by "s" and followed by \w
How can I use regex for all words beginning with : punctuation?
This gets all words beginning with a:
\ba\w*\b
The minute I change the letter a to :, the whole thing fails. Am I supposed to escape the colon, and if so, how?
\b matches between a non-alphanumeric and an alphanumeric character, so if you place it before :, it only matches if there is a letter/digit right before the colon.
So you either need to drop the \b here or specify what exactly constitutes a boundary in this situation, for example:
(?<!\w):\w*\b
That would ensure that there is no letter/digit/underscore right before the :. Of course this presumes a regex flavor that supports lookbehind assertions.
The problem is that \b won't match the start of a word when the word starts with a colon :, because colon is not a word character. Try this:
(?<=:)\w*\b
This uses a (non-capturing) look-behind to assert that the previous character is a colon.