From regular-expressions.info:
\b\w+(?<!s)\b. This is definitely not the same as \b\w+[^s]\b. When applied to Jon's, the former will match Jon and the latter Jon' (including the apostrophe). I will leave it up to you to figure out why. (Hint: \b matches between the apostrophe and the s). The latter will also not match single-letter words like "a" or "I".
Can you explain why ?
Also, can you make clear what exacly \b does, and why it matches between the apostrophe and the s ?
\b is a zero-width assertion that means word boundary. These character positions (taken from that link) are considered word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are of course any \w. s is a word character, but ' is not. In the above example, the area between the ' and the s is a word boundary.
The string "Jon's" looks like this if I highlight the anchors and boundaries (the first and last \bs occur in the same positions as ^ and $): ^Jon\b'\bs$
The negative lookbehind assertion (?<!s)\b means it will only match a word boundary if it's not preceded by the letter s (i.e. the last word character is not an s). So it looks for a word boundary under a certain condition.
Therefore the first regex works like this:
\b\w+ matches the first three letters J o n.
There's actually another word boundary between n and ' as shown above, so (?<!s)\b matches this word boundary because it's preceded by an n, not an s.
Since the end of the pattern has been reached, the resultant match is Jon.
The complementary character class [^s]\b means it will match any character that is not the letter s, followed by a word boundary. Unlike the above, this looks for one character followed by a word boundary.
Therefore the second regex works like this:
\b\w+ matches the first three letters J o n.
Since the ' is not the letter s (it fulfills the character class [^s]), and it's followed by a word boundary (between ' and s), it's matched.
Since the end of the pattern has been reached, the resultant match is Jon'. The letter s is not matched because the word boundary before it has already been matched.
The example is trying to demonstrate that lookaheads and lookbehinds can be used to create "and" conditions.
\b\w+(?<!s)\b
could also be written as
\b\w*\w(?<!s)\b
That gives us
\b\w*[^s]\b vs \b\w*\w(?<!s)\b
I did that so we can ignore the irrelevant. (The \b are simply distractions in this example.) We have
[^s] vs \w(?<!s)
On the left, we can match any character except "s"
On the right, we can match any word character except "s"
By the way,
\w(?<!s)
could also be written
(?!s)\w # Not followed by "s" and followed by \w
Related
I need a regular expression to match the first word with character 'a' in it for each line. For example my test string is this:
bbsc abcd aaaagdhskss
dsaa asdd aaaagdfhdghd
wwer wwww awww wwwd
Only the ones in BOLD fonts should be matched. How can I do that? I can match all the words with 'a' in it, but can't figure out how to only match the first occurrence.
Under the assumption that the only characters being used are word characters, i.e. \w characters, and white space then use:
/^(?:[^a ]+ +)*([^a ]*a\w*)\b/gm
^ Matches the start of the line
(?:[^a ]+ +)* Matches 0 or more occurrences of words composed of any character other than an a followed by one or more spaces in a non-capturing group.
([^a ]*a\w*)\b Matches a word ending on a word boundary (it is already guaranteed to begin on a word boundary) that contains an a. The word-boundary constraint allows for the word to be at the end of the line.
The first word with an a in it will be in group #1.
See demo
If we cannot assume that only word (\w) and white space characters are present, then use:
^(?:[^a ]+ +)*(\w*a\w*)\b
The difference is in scanning the first word with an a in it, (\w*a\w*), where we are guaranteed that we are scanning a string composed of only word characters.
What are you using? In many programs you can set limit. If possible: \b[b-z]*a[a-z]* with 1 limit.
If it is not possible, use group to capture and match latter: ([b-z]*a[a-z]*).*
Try:
^(?:[^a ]+ )*(\w*a\w*) .*$
Basically what it says is: capture a bunch of words that are composed of anything but the letter a (or <space>) then capture a word that must include the letter a.
Group 1 should hold the first word with a.
I thought [^0-9a-zA-Z]* excludes all alpha-numeric letters, but allows for special characters, spaces, etc.
With the search string [^0-9a-zA-Z]*ELL[^0-9A-Z]* I expect outputs such as
ELL
ELLs
The ELL
Which ELLs
However I also get following outputs
Ellis Island
Bellis
How to correct this?
You may use
(?:\b|_)ELLs?(?=\b|_)
See the regex demo.
It will find ELL or ELLs if it is surrounded with _ or non-word chars, or at the start/end of the string.
Details:
(?:\b|_) - a non-capturing alternation group matching a word boundary position (\b) or (|) a _
ELLs? - matches ELL or ELLs since s? matches 1 or 0 s chars
(?=\b|_) - a positive lookahead that requires the presence of a word boundary or _ immediately to the right of the current location.
change the * to +
a * means any amount including none. A + means one or more. What you probably want though is a word boundry:
\bELL\b
A word boundry is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]). More here about that:
What is a word boundary in regexes?
I am trying to capture every word in a string except for 'and'. I also want to capture words that are surrounded by asterisks like *this*. The regex command I am using mostly works, but when it captures a word with asterisks, it will leave out the first one (so *this* would only have this* captured). Here is the regex I'm using:
/((?!and\b)\b[\w*]+)/gi
When I remove the last word boundary, it will capture all of *this* but won't leave out any of the 'and' s.
The problem is that * is not treated as a word character, so \b don't match a position before it. I think you can replace it with:
^(?!and\b)([\w*]+)|((?!and\b)(?<=\W)[\w*]+)
The \b was repleced with \W (non-word character) to match also *, however then the first word in string will not match because is not precedeed by non-word character. This is why I added alternative.
DEMO
Can anyone explain the difference between \b and \w regular expression metacharacters? It is my understanding that both these metacharacters are used for word boundaries. Apart from this, which meta character is efficient for multilingual content?
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is
a word character.
After the last character in the string, if the
last character is a word character.
Between two characters in the
string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w. Flavors showing "ascii" for word boundaries in the flavor comparison recognize only these as word characters.
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.
\W is short for [^\w], the negated version of \w.
\w matches a word character. \b is a zero-width match that matches a position character that has a word character on one side, and something that's not a word character on the other. (Examples of things that aren't word characters include whitespace, beginning and end of the string, etc.)
\w matches a, b, c, d, e, and f in "abc def"
\b matches the (zero-width) position before a, after c, before d, and after f in "abc def"
See: http://www.regular-expressions.info/reference.html/
#Mahender, you probably meant the difference between \W (instead of \w) and \b. If not, then I would agree with #BoltClock and #jwismar above. Otherwise continue reading.
\W would match any non-word character and so its easy to try to use it to match word boundaries. The problem is that it will not match the start or end of a line. \b is more suited for matching word boundaries as it will also match the start or end of a line. Roughly speaking (more experienced users can correct me here) \b can be thought of as (\W|^|$). [Edit: as #Ωmega mentions below, \b is a zero-length match so (\W|^|$) is not strictly correct, but hopefully helps explain the diff]
Quick example: For the string Hello World, .+\W would match Hello_ (with the space) but will not match World. .+\b would match both Hello and World.
\b <= this is a word boundary.
Matches at a position that is followed by a word character but not preceded by a word character, or that is preceded by a word character but not followed by a word character.
\w <= stands for "word character".
It always matches the ASCII characters [A-Za-z0-9_]
Is there anything specific you are trying to match?
Some useful regex websites for beginners or just to wet your appetite.
http://www.regular-expressions.info
http://www.javascriptkit.com/javatutors/redev2.shtml
http://www.virtuosimedia.com/dev/php/37-tested-php-perl-and-javascript-regular-expressions
http://www.i-programmer.info/programming/javascript/4862-master-javascript-regular-expressions.html
I found this to be a very useful book:
Mastering Regular Expressions by Jeffrey E.F. Friedl
\w is not a word boundary, it matches any word character, including underscores: [a-zA-Z0-9_]. \b is a word boundary, that is, it matches the position between a word and a non-alphanumeric character: \W or [^\w].
These implementations may vary from language to language though.
How can I use regex for all words beginning with : punctuation?
This gets all words beginning with a:
\ba\w*\b
The minute I change the letter a to :, the whole thing fails. Am I supposed to escape the colon, and if so, how?
\b matches between a non-alphanumeric and an alphanumeric character, so if you place it before :, it only matches if there is a letter/digit right before the colon.
So you either need to drop the \b here or specify what exactly constitutes a boundary in this situation, for example:
(?<!\w):\w*\b
That would ensure that there is no letter/digit/underscore right before the :. Of course this presumes a regex flavor that supports lookbehind assertions.
The problem is that \b won't match the start of a word when the word starts with a colon :, because colon is not a word character. Try this:
(?<=:)\w*\b
This uses a (non-capturing) look-behind to assert that the previous character is a colon.