About Struts2 Email Validator - regex

Struts2 have a perfect email validator. Its regex for single email address is below:
\\b(^[_A-Za-z0-9-](\\.[_A-Za-z0-9-])*#([A-Za-z0-9-])+((\\.com)|(\\.net)|(\\.org)|(\\.info)|(\\.edu)|(\\.mil)|(\\.gov)|(\\.biz)|(\\.ws)|(\\.us)|(\\.tv)|(\\.cc)|(\\.aero)|(\\.arpa)|(\\.coop)|(\\.int)|(\\.jobs)|(\\.museum)|(\\.name)|(\\.pro)|(\\.travel)|(\\.nato)|(\\..{2,3})|(\\..{2,3}\\..{2,3}))$)\\b
It is too long because of validating TLDs. But just look at start and end of it.
My question is about wrapping \\b. What does mean putting \\b in start and end of regex (even before ^ and after $)?

This is about Word Boundaries:
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
You can read more here: http://www.regular-expressions.info/wordboundaries.html

They appear to be superfluous, perhaps remnants of an earlier version of the regex.

Related

Escape brackets in a regex with alternation

I am trying to write a Reg Expression to match any word from a list of words but am having trouble with words with brackets.
This is the reg expression I have so far:
^\b(?:Civil Services|Assets Management|Engineering Works (EW)|EW Maintenance|Ferry|Road Maintenance|Infrastructure Planning (IP)|Project Management Office (PMO)|Resource Recovery (RR)|Waste)\b$
Words with brackets such as Civil Services are matched but not words with brackets such as Engineering Works (EW).
I have tried single escaping with \ and double escaping (\) but neither option seems to return a match when testing words with brackets in them.
How can I also match words with brackets?
The problem is that \b can't match a word boundary the way you want when it's preceded by a ). A word boundary is a word character adjacent to a non-word character or end-of-string. A word character is a letter, digit, or underscore; notably, ) is not a word character. That means that )\b won't match a parenthesis followed by a space, nor a parenthesis at the end of the string.
The easiest fix is to remove the \bs. You don't actually need them since you've already got ^ and $ anchors:
^(?:Orange|Banana|Apple \(Red\)| Apple \(Green\)|Plum|Mango)$
Alternatively, if you want to search in a larger string you could use a lookahead to look a non-word character or end-of-string. This is essentially what \b does except we only look ahead, not behind.
\b(?:Orange|Banana|Apple \(Red\)| Apple \(Green\)|Plum|Mango)(?=\W|$)

How to match a specific word without spaces and without an additional letter in the starting or ending?

Let's say I have word phone
It's possible matches in my case are as follows
phone (no space in the beginning and in the ending just phone)
"phone" (can have special characters at the end or in the beginning)
Cases to be Neglected [Here I'll mark the space with \s]
phone\s (any space in either in the beginning or in the end should not be matched)
phoneno (any alphabets or numbers appended with phone should not be matched)
I've tried the following regex [^\w\s]items[^\w\s] link
But It didn't match the case of phone with no space in the beginning and the end as it requires 1 letter other than space and alphabets in the beginning and the end
Kindly suggest any other solutions which satisfies above mentioned cases
You can find the regex here
You may use custom word boundaries, a combination of \b and (?<!\S) / (?!\S):
(?<![\w\s])phone(?![\w\s])
See the regex demo and the regex graph:
The (?<![\w\s]) negative lookbehind pattern matches a location in string that is NOT immediately preceded with a word or whitespace char.
The (?![\w\s]) negative lookahead pattern matches a location in string that is NOT immediately preceded with a word or whitespace char.

Difference between \w and \b regular expression meta characters

Can anyone explain the difference between \b and \w regular expression metacharacters? It is my understanding that both these metacharacters are used for word boundaries. Apart from this, which meta character is efficient for multilingual content?
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is
a word character.
After the last character in the string, if the
last character is a word character.
Between two characters in the
string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w. Flavors showing "ascii" for word boundaries in the flavor comparison recognize only these as word characters.
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.
\W is short for [^\w], the negated version of \w.
\w matches a word character. \b is a zero-width match that matches a position character that has a word character on one side, and something that's not a word character on the other. (Examples of things that aren't word characters include whitespace, beginning and end of the string, etc.)
\w matches a, b, c, d, e, and f in "abc def"
\b matches the (zero-width) position before a, after c, before d, and after f in "abc def"
See: http://www.regular-expressions.info/reference.html/
#Mahender, you probably meant the difference between \W (instead of \w) and \b. If not, then I would agree with #BoltClock and #jwismar above. Otherwise continue reading.
\W would match any non-word character and so its easy to try to use it to match word boundaries. The problem is that it will not match the start or end of a line. \b is more suited for matching word boundaries as it will also match the start or end of a line. Roughly speaking (more experienced users can correct me here) \b can be thought of as (\W|^|$). [Edit: as #Ωmega mentions below, \b is a zero-length match so (\W|^|$) is not strictly correct, but hopefully helps explain the diff]
Quick example: For the string Hello World, .+\W would match Hello_ (with the space) but will not match World. .+\b would match both Hello and World.
\b <= this is a word boundary.
Matches at a position that is followed by a word character but not preceded by a word character, or that is preceded by a word character but not followed by a word character.
\w <= stands for "word character".
It always matches the ASCII characters [A-Za-z0-9_]
Is there anything specific you are trying to match?
Some useful regex websites for beginners or just to wet your appetite.
http://www.regular-expressions.info
http://www.javascriptkit.com/javatutors/redev2.shtml
http://www.virtuosimedia.com/dev/php/37-tested-php-perl-and-javascript-regular-expressions
http://www.i-programmer.info/programming/javascript/4862-master-javascript-regular-expressions.html
I found this to be a very useful book:
Mastering Regular Expressions by Jeffrey E.F. Friedl
\w is not a word boundary, it matches any word character, including underscores: [a-zA-Z0-9_]. \b is a word boundary, that is, it matches the position between a word and a non-alphanumeric character: \W or [^\w].
These implementations may vary from language to language though.

Regex conditional lookbehind character match

So I want to find the string "to" in a string, but only when it is standalone. It could be at the beginning of the string, as in "to do this", so I can't search " to ".
What I want to do is say, if there is a character behind "to", it cannot be \w. How do I do that?
Try word boudaries. It matches the beginning and the end of the searched pattern
\bto\b
This is exaclty what you want to say, i.e.
So what exactly is it that \b matches? Regular expression engines do not understand English, or any language for that matter, and so they don't know what word boundaries are. \b simply matches a location between characters that are usually parts of words (alphanumeric characters and underscore, text that would be matched by \w) and anything else (text that would be matched by \W).
Sams Teach Yourself Regular Expressions in 10 Minutes
By Ben Forta
Try using \bto\b, which will match to as a stand-alone word
Here's a good explanation:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search
using a regular expression in the form of \bword\b. A "word
character" is a character that can be used to form words. All
characters that are not "word characters" are "non-word characters".

Regex: Difference betwen negative lookbehind and negation

From regular-expressions.info:
\b\w+(?<!s)\b. This is definitely not the same as \b\w+[^s]\b. When applied to Jon's, the former will match Jon and the latter Jon' (including the apostrophe). I will leave it up to you to figure out why. (Hint: \b matches between the apostrophe and the s). The latter will also not match single-letter words like "a" or "I".
Can you explain why ?
Also, can you make clear what exacly \b does, and why it matches between the apostrophe and the s ?
\b is a zero-width assertion that means word boundary. These character positions (taken from that link) are considered word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are of course any \w. s is a word character, but ' is not. In the above example, the area between the ' and the s is a word boundary.
The string "Jon's" looks like this if I highlight the anchors and boundaries (the first and last \bs occur in the same positions as ^ and $): ^Jon\b'\bs$
The negative lookbehind assertion (?<!s)\b means it will only match a word boundary if it's not preceded by the letter s (i.e. the last word character is not an s). So it looks for a word boundary under a certain condition.
Therefore the first regex works like this:
\b\w+ matches the first three letters J o n.
There's actually another word boundary between n and ' as shown above, so (?<!s)\b matches this word boundary because it's preceded by an n, not an s.
Since the end of the pattern has been reached, the resultant match is Jon.
The complementary character class [^s]\b means it will match any character that is not the letter s, followed by a word boundary. Unlike the above, this looks for one character followed by a word boundary.
Therefore the second regex works like this:
\b\w+ matches the first three letters J o n.
Since the ' is not the letter s (it fulfills the character class [^s]), and it's followed by a word boundary (between ' and s), it's matched.
Since the end of the pattern has been reached, the resultant match is Jon'. The letter s is not matched because the word boundary before it has already been matched.
The example is trying to demonstrate that lookaheads and lookbehinds can be used to create "and" conditions.
\b\w+(?<!s)\b
could also be written as
\b\w*\w(?<!s)\b
That gives us
\b\w*[^s]\b vs \b\w*\w(?<!s)\b
I did that so we can ignore the irrelevant. (The \b are simply distractions in this example.) We have
[^s] vs \w(?<!s)
On the left, we can match any character except "s"
On the right, we can match any word character except "s"
By the way,
\w(?<!s)
could also be written
(?!s)\w # Not followed by "s" and followed by \w