how to match non-word+word boundary in javascript regex - regex

how to match non-word+word boundary in javascript regex.
"This is, a beautiful island".match(/\bis,\b/)
In the above case why does not the regex engine match till is, and assume the space to be a word boundary without moving further.

\b asserts a position where a word character \w meets a non-word character \W or vice versa. Comma is a non-word character and space is as well. So \b never matches a position between a comma and a space.
Also you forgot to put ending delimiter in your regex.

You can use \B after comma that matches where \b doesn't since comma is not considered a word character.
console.log( "This is, a beautiful island".match(/\bis,\B/) )
//=> ["is,"]

Related

Escape brackets in a regex with alternation

I am trying to write a Reg Expression to match any word from a list of words but am having trouble with words with brackets.
This is the reg expression I have so far:
^\b(?:Civil Services|Assets Management|Engineering Works (EW)|EW Maintenance|Ferry|Road Maintenance|Infrastructure Planning (IP)|Project Management Office (PMO)|Resource Recovery (RR)|Waste)\b$
Words with brackets such as Civil Services are matched but not words with brackets such as Engineering Works (EW).
I have tried single escaping with \ and double escaping (\) but neither option seems to return a match when testing words with brackets in them.
How can I also match words with brackets?
The problem is that \b can't match a word boundary the way you want when it's preceded by a ). A word boundary is a word character adjacent to a non-word character or end-of-string. A word character is a letter, digit, or underscore; notably, ) is not a word character. That means that )\b won't match a parenthesis followed by a space, nor a parenthesis at the end of the string.
The easiest fix is to remove the \bs. You don't actually need them since you've already got ^ and $ anchors:
^(?:Orange|Banana|Apple \(Red\)| Apple \(Green\)|Plum|Mango)$
Alternatively, if you want to search in a larger string you could use a lookahead to look a non-word character or end-of-string. This is essentially what \b does except we only look ahead, not behind.
\b(?:Orange|Banana|Apple \(Red\)| Apple \(Green\)|Plum|Mango)(?=\W|$)

regex : how to match word ending with parentheses ")"

I want to match string ending with ')' .
I use pattern :
"[)]\b" or ".*[)]\b"
It should match the string :
x=main2.addMenu('Edit')
But it doesn't work. What is wrong ?
The \b only matches a position at a word boundary. Think of it as a (^\w|\w$|\W\w|\w\W) where \w is any alphanumeric character and \W is any non-alphanumeric character. The parenthesis is non-alphanumeric so won't be matched by \b.
Just match a parethesis, followed by the end of the string by using \)$
If you want to capture a string ending in ) (and not just find a trailing )), then you can use this in JS:
(.*?\)$)
(....) - captures the defined content;
.*? - matches anything up to the next element;
\)$ - a ) at the end of the string (needs to be escaped);
Regex101
The \b word boundary is ambiguous: after a word character, it requires that the next character must a non-word one or the end of string. When it stands after a non-word char (like )) it requires a word character (letter/digit/underscore) to appear right after it (not the end of the string here!).
So, there are three solutions:
Use \B (a non-word boundary): .*[)]\B (see demo) that will not allow matching if the ) is followed with a word character
Use .*[)]$ with MULTILINE mode (add (?m) at the start of the pattern or add the /m modifier, see demo)
Emulate the multiline mode with an alternation group: .*[)](\r?\n|$) (see demo)

Difference between \w and \b regular expression meta characters

Can anyone explain the difference between \b and \w regular expression metacharacters? It is my understanding that both these metacharacters are used for word boundaries. Apart from this, which meta character is efficient for multilingual content?
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is
a word character.
After the last character in the string, if the
last character is a word character.
Between two characters in the
string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w. Flavors showing "ascii" for word boundaries in the flavor comparison recognize only these as word characters.
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.
\W is short for [^\w], the negated version of \w.
\w matches a word character. \b is a zero-width match that matches a position character that has a word character on one side, and something that's not a word character on the other. (Examples of things that aren't word characters include whitespace, beginning and end of the string, etc.)
\w matches a, b, c, d, e, and f in "abc def"
\b matches the (zero-width) position before a, after c, before d, and after f in "abc def"
See: http://www.regular-expressions.info/reference.html/
#Mahender, you probably meant the difference between \W (instead of \w) and \b. If not, then I would agree with #BoltClock and #jwismar above. Otherwise continue reading.
\W would match any non-word character and so its easy to try to use it to match word boundaries. The problem is that it will not match the start or end of a line. \b is more suited for matching word boundaries as it will also match the start or end of a line. Roughly speaking (more experienced users can correct me here) \b can be thought of as (\W|^|$). [Edit: as #Ωmega mentions below, \b is a zero-length match so (\W|^|$) is not strictly correct, but hopefully helps explain the diff]
Quick example: For the string Hello World, .+\W would match Hello_ (with the space) but will not match World. .+\b would match both Hello and World.
\b <= this is a word boundary.
Matches at a position that is followed by a word character but not preceded by a word character, or that is preceded by a word character but not followed by a word character.
\w <= stands for "word character".
It always matches the ASCII characters [A-Za-z0-9_]
Is there anything specific you are trying to match?
Some useful regex websites for beginners or just to wet your appetite.
http://www.regular-expressions.info
http://www.javascriptkit.com/javatutors/redev2.shtml
http://www.virtuosimedia.com/dev/php/37-tested-php-perl-and-javascript-regular-expressions
http://www.i-programmer.info/programming/javascript/4862-master-javascript-regular-expressions.html
I found this to be a very useful book:
Mastering Regular Expressions by Jeffrey E.F. Friedl
\w is not a word boundary, it matches any word character, including underscores: [a-zA-Z0-9_]. \b is a word boundary, that is, it matches the position between a word and a non-alphanumeric character: \W or [^\w].
These implementations may vary from language to language though.

How can I use regex for all words beginning with : punctuation?

How can I use regex for all words beginning with : punctuation?
This gets all words beginning with a:
\ba\w*\b
The minute I change the letter a to :, the whole thing fails. Am I supposed to escape the colon, and if so, how?
\b matches between a non-alphanumeric and an alphanumeric character, so if you place it before :, it only matches if there is a letter/digit right before the colon.
So you either need to drop the \b here or specify what exactly constitutes a boundary in this situation, for example:
(?<!\w):\w*\b
That would ensure that there is no letter/digit/underscore right before the :. Of course this presumes a regex flavor that supports lookbehind assertions.
The problem is that \b won't match the start of a word when the word starts with a colon :, because colon is not a word character. Try this:
(?<=:)\w*\b
This uses a (non-capturing) look-behind to assert that the previous character is a colon.

Regex: word boundary but for white space, beginning of line or end of line only

I am looking for some word boundary to cover those 3 cases:
beginning of string
end of string
white space
Is there something like that since \b covers also -,/ etc.?
Would like to replace \b in this pattern by something described above:
(\b\d*\sx\s|\b\d*x|\b)
Try replacing \b with (?:^|\s|$)
That means
(
?: don't consider this group a match
^ match beginning of line
| or
\s match whitespace
| or
$ match end of line
)
Works for me in Python and JavaScript.
OK, so your real question is:
How do I match a unit, optionally preceded by a quantity, but only if there is either nothing or a space right before the match?
Use
(?<!\S)\b(?:\d+\s*x\s*)?\d+(?:\.\d+)?\s*ml\b
Explanation
(?<!\S): Assert that it's impossible to match a non-space character before the match.
\b: Match a word boundary
(?:\d+\s*x\s*)?: Optionally match a quantifier (integers only)
\d+(?:\.\d+)?: Match a number (decimals optional)
\s*ml\b: Match ml, optionally preceded by whitespace.
Boundaries that you get with \b are not whitespace sensitive. They are complicated conditional assertions related to the transition between \w\W or \W\w. See this answer for how to write your anchor more precisely, so that you can deal with whitespace the way you want.