RegExp: find "cleverness" in a string - regex

My RegExpression:
((^|\s)(clever)($|\s))
It finds "clever" in the string:
clever or not
yahoo clever
but it doesn't find "clever" in this string:
what means cleverness
I don't want to bother you with the three other RegExp variations of my line above but I tried different approaches already but can't make it work.
I am filtering terms in a table to cluster them into defined groups. I am looking for the adjective "clever". I dont want to find strings where clever is part of another word, in example "MacLever" or "alcleveracio".

Try this :
((^|\s)(clever))
Your regex contains ($|\s) will force clever to be before a space or at the end of the string.

Try using ^(.*\W)?(clever)(\W.*)?$instead. \W matches any non-word character, so this will enforce that any string before "clever" include a nonword character at the end (and vice versa for the end.
You can plug it into https://regex101.com/ to see how it is working and test it out.

You can use the word boundary \b.
\bclever\w*\b
or maybe better (no capitals allowed)
\bclever[a-z]*\b
If "clever" should be either at the beginning or at the end:
\b([a-zA-Z]+)?clever(?(1)|[a-z]*)\b
\b beginig of the string
([a-zA-Z]+) at least one character
? match even group is empty
clever matches the characters
(?(1) starts a condition, depends on group 1
|[a-z]*) if group matches, there doesn't may be any chars, else ( | ) there may be any lower case chars ( [a-z]* )
\b the final word boundary
Test and visualizing: Debuggex Demo
Infos about If-Then-Else
(visulized by Regulex)
Test it on regex101

Related

Simplify this repeating regex

I have the following valid regex to match various excel cell/range patterns, of the form A1, A1:Z12, etc.
^(?:[A-Za-z]{1,3}\d{0,10})(?::(?:[A-Za-z]{1,3}\d{0,10}))?$
Is there a more compact way to do the second part of the match? Basically, the : <repeat> part I was hoping to be able to do it with something like:
^ (<main_part> ':'<lookahead, keep if before an A-Z> ){1,2} $
Any way to do that pattern?
A way without capture groups or lookarounds, use a word-boundary:
^(?:\b:?[A-Z]{1,3}[0-9]{1,10}){1,2}$
demo
The word-boundary can't succeed between the start of the string and a colon nor between a digit and a letter, but it does between a digit and a colon or between the start of the string and a letter.
Obviously, it's also possible to do it like that for the same kind of reasons:
^(?:[A-Z]{1,3}[0-9]{1,10}:?\b){1,2}$
(You win one step more with this one, YAY!)
test cases (first pattern):
with :A2
It fails because \b fails between the start of the string and a non-word character (the colon).
with A2:
It fails because there's no colon at the end of the sub-pattern (that is not repeated in this case).
with A2:A2
The pattern succeeds. \b succeeds because the first time it is between the start of the string and a letter (a word character), and the second time because it is between a digit (a word character too) and a colon (a non-word character).
Here would be an example pattern you can use, note that AB:AB is not a valid range as described above so that has been modified as well to \d{1,10}:
^(?:[A-Z]{1,3}[0-9]{1,10}(?::(?=[A-Z]))?){1,2}$
And a better approach would be to use ?1 to recurse to the first pattern:
^([A-Z]{1,3}[0-9]{1,10})(:(?1))?$
Note however with this approach we do need the extraneous capturing group at the beginning for this technique to work.

Regex to match other than listed string

I need to select a value which not listed in following string including all special characters.
List of string and requirement that need to rejected:
XNIL
SNIL
All special characters
My expression is like this (?!XNIL|SNIL|[\W])\w+
The problem is, if my text have a word XNIL or SNIL, it still allow the word NIL. But i have listed the word XNIL and SNIL to be rejected. Any mistake did i made here?
You can check my regex online here -> http://regexr.com/3cdsl
This seems to work on your test page: (?!(XNIL|SNIL|\W+))\b\w+ At least it solves the XNIL/SNIL problem.
The reason why your regex was matching XNIL was it was matching from the \w+. To see why, take your original and change \w+ to \w and notice the difference.
UPDATE:
Based on your feedback, you also wish to exclude _.
Because _ is used in programming language symbols, and [arguably] regexes were created, of, by, and for programmers, _ is considered a "word" char (i.e. it's in \w and therefore not excluded by \W).
From the [perl] regex man page:
\w Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks)
Your final regex might need to be: (?!(XNIL|SNIL|_+|\W+))\b\w+. (Note: the _+)
A cleaner way: (?!(XNIL|SNIL|[\W_]+))\b\w+ which produces the same results yet is closer in intent to what you wanted.
You may have to adjust \w+ accordingly as well
If you really want to be sure, at the expense of being slightly more verbose, write out the character class as you choose:
(?!(XNIL|SNIL|[^a-zA-Z0-9]+))\b[a-zA-Z0-9]+
Check this regex
[^(XNIL|SNIL|[^\w])]
Explanation
[] having ^ at beginning says the that any thing that is not there in the list given in [] should be matched.
(XNIL|SNIL|[^\w+]) matches words XNIL or SNIL or [^\w] matches anything other than words(i.e. special chars)
So the whole regex matches any thing that is not there in [^(XNIL|SNIL|[^\w])]
This should work
(?m)^(((?!XNIL|SNIL|[\W]).)*)$
Grouping the character match with the negative lookahead will cause the zero length assertion to continue until finished (in this case at the end of the string due to $)

Regex matching on word boundary OR non-digit

I'm trying to use a Regex pattern (in Java) to find a sequence of 3 digits and only 3 digits in a row. 4 digits doesn't match, 2 digits doesn't match.
The obvious pattern to me was:
"\b(\d{3})\b"
That matches against many source string cases, such as:
">123<"
" 123-"
"123"
But it won't match against a source string of "abc123def" because the c/1 boundary and the 3/d boundary don't count as a "word boundary" match that the \b class is expecting.
I would have expected the solution to be adding a character class that includes both non-Digit (\D) and the word boundary (\b). But that appears to be illegal syntax.
"[\b\D](\d{3})[\b\D]"
Does anybody know what I could use as an expression that would extract "123" for a source string situation like:
"abc123def"
I'd appreciate any help. And yes, I realize that in Java one must double-escape the codes like \b to \b, but that's not my issue and I didn't want to limit this to Java folks.
You should use lookarounds for those cases:
(?<!\d)(\d{3})(?!\d)
This means match 3 digits that are NOT followed and preceded by a digit.
Working Demo
Lookarounds can solve this problem, but I personally try to avoid them because not all regex engines fully support them. Additionally, I wouldn't say this issue is complicated enough to merit the use of lookarounds in the first place.
You could match this: (?:\b|\D)(\d{3})(?:\b|\D)
Then return: \1
Or if you're performing a replacement and need to match the entire string: (?:\b|\D)+(\d{3})(?:\b|\D)+
Then replace with: \1
As a side note, the reason \b wasn't working as part of a character class was because within brackets, [\b] actually has a completely different meaning--it refers to a backspace, not a word boundary.
Here's a Working Demo.

Regex to match number in #define statement

I have a line like this:
#define PROG_HWNR "36084"
or this:
#define PROG_HWNR "#37595"
I'd like to extract the number (and increase it, but that's not the matter here)
I wrote a regex, but it's not working (at least in http://gskinner.com/RegExr/ )
(?<="#?)(.*?)(?=")
I also tried variations like
(?<=("#?))(.*?)(?=")
or
(?<=("|"#)))(.*?)(?=")
But no success. The problem is, that I want to match only the number, no matter if there is a # or not ...
Can you point me in the right direction? Thanks!!
Try this regex:
"#?(\d+)"$
It will match:
" a quote
#? optional hash
( (start capturing)
\d+ one or more digits
) (stop capturing)
" a quote
$ anchor to end
Here is a JSFiddle, and here is a RegExr
The problem is the variable length of the lookbehind. Only few regex engines can deal with this. Because there are only two possible lookbehinds (including the # or not), you can expand that into two lookbehinds:
(?:(?<="#)|(?<=")).*?(?=")
Note that you don't need to capture the .*? if you use lookarounds, as they are excluded from the match anyway. Also, a better way than using non-greedy .*? is to use a greedy expression that can never go past the ending delimiter:
(?:(?<="#)|(?<="))[^"]*(?=")
Alternatively (if you can access captured submatches), you can use a capturing approach and get rid of the lookarounds:
"#?([^"]*)"
Try this:
^#define \w+ "#?(\d+)"$
That will match the whole line, with the first/single group being the number you are looking for.
This is actually pretty basic regex functionality: match an optional character (?) and match a group of characters (the parentheses).
You can even go one simpler:
\d+
will match a string of digits. Only the digits. And ignore the rest of the input string.
Use this tool for testing this stuff, I found it pretty handy: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

Regex - how to exclude single word?

I am using http://www.position-absolute.com/articles/jquery-form-validator-because-form-validation-is-a-mess/ for validation. Validation rules are defined in a following way:
"onlyLetterSp": {
"regex": /^[a-zA-Z\ \']+$/,
"alertText": "* Only letters"
}
I would like to add new rule, which will exclude one single word. I have read some similar questions on StackOverflow and tried to declare it with something like this
"regex": /(?!exclude_word)\^[a-zA-Z\ \']+$/,
But it didn't work. Can you give me some advices how to do it?
This is a good time to use word boundary assertions, like #FailedDev indicated, but care needs to be exercised to avoid rejecting certain not-TOO-special cases, such as wordy, wordsmith or even not so obviously cases like sword or foreword
I believe this will work pretty well:
\b(?!\bword\b)\w+\b
This is the expression broken down:
\b # assert at a word boundary
(?! # look ahead and assert that what follows IS NOT...
\b # a word boundary
word # followed by the exact characters `word`
\b # followed by a word boundary
) # end look-ahead assertion
\w+ # match one or more word characters: `[a-zA-Z0-9_]`
\b # then a word boundary
The expression in the original question, however, matches more than word characters. [a-zA-Z\ \']+ matches spaces (to support multiple words in the input) and single quotes as well (for apostrophes?). If you need to allow words with apostrophes in them then use the following expression:
\b(?!\bword\b)[a-zA-Z']+\b
\b(?:(?!word)\w)+\b
Will not match the "word".
It's unclear from your question what you want, but I've interpreted it as "not matching input that contains a particular word". The regex for this is:
^(?!.*\bexclude_word\b)