Regex - how to exclude single word? - regex

I am using http://www.position-absolute.com/articles/jquery-form-validator-because-form-validation-is-a-mess/ for validation. Validation rules are defined in a following way:
"onlyLetterSp": {
"regex": /^[a-zA-Z\ \']+$/,
"alertText": "* Only letters"
}
I would like to add new rule, which will exclude one single word. I have read some similar questions on StackOverflow and tried to declare it with something like this
"regex": /(?!exclude_word)\^[a-zA-Z\ \']+$/,
But it didn't work. Can you give me some advices how to do it?

This is a good time to use word boundary assertions, like #FailedDev indicated, but care needs to be exercised to avoid rejecting certain not-TOO-special cases, such as wordy, wordsmith or even not so obviously cases like sword or foreword
I believe this will work pretty well:
\b(?!\bword\b)\w+\b
This is the expression broken down:
\b # assert at a word boundary
(?! # look ahead and assert that what follows IS NOT...
\b # a word boundary
word # followed by the exact characters `word`
\b # followed by a word boundary
) # end look-ahead assertion
\w+ # match one or more word characters: `[a-zA-Z0-9_]`
\b # then a word boundary
The expression in the original question, however, matches more than word characters. [a-zA-Z\ \']+ matches spaces (to support multiple words in the input) and single quotes as well (for apostrophes?). If you need to allow words with apostrophes in them then use the following expression:
\b(?!\bword\b)[a-zA-Z']+\b

\b(?:(?!word)\w)+\b
Will not match the "word".

It's unclear from your question what you want, but I've interpreted it as "not matching input that contains a particular word". The regex for this is:
^(?!.*\bexclude_word\b)

Related

RegExp: find "cleverness" in a string

My RegExpression:
((^|\s)(clever)($|\s))
It finds "clever" in the string:
clever or not
yahoo clever
but it doesn't find "clever" in this string:
what means cleverness
I don't want to bother you with the three other RegExp variations of my line above but I tried different approaches already but can't make it work.
I am filtering terms in a table to cluster them into defined groups. I am looking for the adjective "clever". I dont want to find strings where clever is part of another word, in example "MacLever" or "alcleveracio".
Try this :
((^|\s)(clever))
Your regex contains ($|\s) will force clever to be before a space or at the end of the string.
Try using ^(.*\W)?(clever)(\W.*)?$instead. \W matches any non-word character, so this will enforce that any string before "clever" include a nonword character at the end (and vice versa for the end.
You can plug it into https://regex101.com/ to see how it is working and test it out.
You can use the word boundary \b.
\bclever\w*\b
or maybe better (no capitals allowed)
\bclever[a-z]*\b
If "clever" should be either at the beginning or at the end:
\b([a-zA-Z]+)?clever(?(1)|[a-z]*)\b
\b beginig of the string
([a-zA-Z]+) at least one character
? match even group is empty
clever matches the characters
(?(1) starts a condition, depends on group 1
|[a-z]*) if group matches, there doesn't may be any chars, else ( | ) there may be any lower case chars ( [a-z]* )
\b the final word boundary
Test and visualizing: Debuggex Demo
Infos about If-Then-Else
(visulized by Regulex)
Test it on regex101

Using regex to find abbreviations

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.
If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.

Exclude any word followed by # from regex

I'm trying to use the regex.replace function in VB.NET, and I want to exclude any word that has an # symbol after it. At the moment, the pattern I'm using is "/b" & Term & "/b" (where Term is whatever word I want to replace).
Thanks.
You may try this:
\b(?<!#)[^#\s]+(?!#)\b
Regex Demo
Explanation
[^#\s]+ This will exclude any word that has '#'within or just
after it. character class [^] that starts with ^ indicates negate anything that is within the character class. Thus, ^ inside [] doesn't mean start of a string.
In many flavor The word boundary \b includes # as a boundary value.
Therefore you need to make sure that \b doesn't consider # as a
boundary. Therefore the lookahead and lookbehind has been introduced
here.
The first \b(?<!#) ensures word boundary but not #
The last (?!#)\b ensures word boundy but not #

Regex matching on word boundary OR non-digit

I'm trying to use a Regex pattern (in Java) to find a sequence of 3 digits and only 3 digits in a row. 4 digits doesn't match, 2 digits doesn't match.
The obvious pattern to me was:
"\b(\d{3})\b"
That matches against many source string cases, such as:
">123<"
" 123-"
"123"
But it won't match against a source string of "abc123def" because the c/1 boundary and the 3/d boundary don't count as a "word boundary" match that the \b class is expecting.
I would have expected the solution to be adding a character class that includes both non-Digit (\D) and the word boundary (\b). But that appears to be illegal syntax.
"[\b\D](\d{3})[\b\D]"
Does anybody know what I could use as an expression that would extract "123" for a source string situation like:
"abc123def"
I'd appreciate any help. And yes, I realize that in Java one must double-escape the codes like \b to \b, but that's not my issue and I didn't want to limit this to Java folks.
You should use lookarounds for those cases:
(?<!\d)(\d{3})(?!\d)
This means match 3 digits that are NOT followed and preceded by a digit.
Working Demo
Lookarounds can solve this problem, but I personally try to avoid them because not all regex engines fully support them. Additionally, I wouldn't say this issue is complicated enough to merit the use of lookarounds in the first place.
You could match this: (?:\b|\D)(\d{3})(?:\b|\D)
Then return: \1
Or if you're performing a replacement and need to match the entire string: (?:\b|\D)+(\d{3})(?:\b|\D)+
Then replace with: \1
As a side note, the reason \b wasn't working as part of a character class was because within brackets, [\b] actually has a completely different meaning--it refers to a backspace, not a word boundary.
Here's a Working Demo.

Regex negation - word parsing

I am trying to parse a phrase and exclude common words.
For instance in the phrase "as the world turns", I want to exclude the common words "as" and "the" and return only "world" and "turns".
(\w+(?!the|as))
Doesn't work. Feedback appreciated.
The lookahead should come first:
(\b(?!(the|as)\b)\w+\b)
I have also added word boundaries to ensure that it only matches whole words otherwise it would fail to match the complete word "as" but it would successfully match the letter "s" of that word.
You might also want to consider what \w matches and if that meets your needs. If you are looking for words in English you probably are interested in letters but not digits and you may wish to include some punctuation characters that are excluded by \w, such as apostrophes. You could try something like this instead (Rubular):
/(\b(?!(?:the|as)\b)[a-z'-]+\b)/i
To match words more accurately in a human language you could consider using a natural language parsing library instead of regular expressions.
You should use word boundaries to only match whole words. Either with a look-ahead assertion:
(\b(?!(?:the|as)\b)\w+\b)
Or with a look-behind assertion:
(\b\w+\b(?<!\b(?:the|as)))