Regex for blacklist and whitelist words - regex

I'm trying to set up regex for a blacklist and whitelist, flagging blacklisted words and ignoring whitelisted words. Here are the rules:
I want to see if a word or phrase on the blacklist exists in the input string.
The blacklist words should be matched regardless of where they appear (full word or as substring).
The whitelist words (i.e. words that are known to be okay even though they contain blacklisted words) are not to be matched if they are full words only.
Blacklist words I want to search for and match if found: BUNNY, GARDEN, HOLE
Whitelist words that are clean and can be ignored even though they contain blacklisted words: WHOLE, GARDENER
I made the following regex using negative lookbehind:
(BUNNY|GARDEN|HOLE)(?<!\bWHOLE\b|\bGARDENER\b)
My silly example string:
This whole hole is a wholey mistake in the gardener agardener.
I would expect only the following be matched:
"hole"
"wholey"
"agardener"
It mostly works, since "whole" doesn't match but "wholey" does and "agardener" is also a match. However, "gardener" matches even though it's in the whitelist. What am I missing?

You can use
\w*(?:BUNNY|GARDEN|HOLE)\w*\b(?<!\bWHOLE|\bGARDENER)
See the regex demo.
A variation without a lookbehind, but with a lookahead:
\b(?!(?:WHOLE|GARDENER)\b)\w*(?:BUNNY|GARDEN|HOLE)\w*\b
See this regex demo.
Details:
\w* - zero or more word chars
(?:BUNNY|GARDEN|HOLE) - one of the required word parts
\w* - zero or more word chars
\b - a word boundary
(?<!\bWHOLE|\bGARDENER) - a negative lookbehind that fails the match if there whole word situated on the left is WHOLE or GARDENER.
The \b(?!(?:WHOLE|GARDENER)\b)\w*(?:BUNNY|GARDEN|HOLE)\w*\b matches a word boundary first, then fails the match if the next chars are a WHOLE or GARDENER whole words and then matches a word with BUNNY, GARDEN or HOLE substring in it.
Replace \w with [a-zA-Z] or \p{L} (or [[:alpha:]]) if supported and you need to only match letter words.

Related

Using regex to find abbreviations

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.
If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.

I need a regex to search in visual studio code for words starting with a specific string followed by letters that vary but exclude 3 specific cases

I need to search within my project, all words starting with the string "use" followed by any uppercase letter and other than these three cases:
useRef
useEffect
useState
Valid searches would be:
useExample
useTest
useWhatever
And invalid searches:
usefoo
usebar
In addition to the 3 strings mentioned above.
This is as far as I've managed to go, but in vscode it seems to have a different behavior than any regular expression checker and I don't really know where to go from here:
^(?!useRef)(use.*)
You can use the following regex:
\buse(?!(?:Ref|Effect|State)\b)[A-Z][a-zA-Z]*\b
See this regex demo.
Pattern details:
\b - a word boundary
use - a use string
(?!(?:Ref|Effect|State)\b) - a negative lookahead that fails the match if there is Ref, Effect or State substrings followed with a word boundary immediately to the right of the current location
[A-Z] - an uppercase ASCII letter
[a-zA-Z]* - any zero or more ASCII letters
\b - a word boundary.
Regex graph:

Eclipse & regular expression that matches word X, excluding some longer words, which include X

As an example: I would like to use Eclipse's File Search to count occurrences of be (case insensitive), but not count occurences of believed, babel, wannabe and become. Let's say that we have example part of "code":
// Belfast is believed to become a part of the world where
// people use word "be" most often; wannabe, babel?
I would like Eclipse to count, that above part of the "code" contains 2 matches (in Belfast and "be"). To sum up, I am looking for a regex, which:
match all words containing be (case insensitive),
and simultaneously:
does not match explicite word become
does not match explicite word babel
does not match explicite word believed
does not match explicite word wannabe
Could you tell me, how can I reach that?
EDIT:
I have edited the question body, beacuse the example which I have provided previously didn't completly match question's title. Moreover, I have provided bulleted list with explicited rules.
Try something like this: (?i)\b(\w*be(?!lieved|come)\w*)\b
Example: https://regex101.com/r/79VEzr/1
Explanation:
(?i) - flag to enable case insensitivity
\b - Match a word boundary (on both ends of expression to match an individual word)
(\w*be(?!lieved|come)\w*) - Capture the word
\w* - Match any sequence of word characters
be - Match be literally
(?!lieved|come) - Negative lookahead to ensure that be isn't followed by lieved or come (removes believed and become from results)
\w* - Match more word characters after be
\b - Match ending word boundary

Ignore one word with regex

I know there are several similar questions already asked. But can't fix this issue with regex.
I have sentence like
Lorem IpsumĀ is http://stack.com text of the http://stack.com/wp-admin
printing and typesetting industry.
I want to cache the word "stack.com" but not stack.com/wp-admin
I have tried few regex but it's not working.
^(?!stack.com$).*
The ^(?!stack.com$).* regex matches any string (even an empty one) that does not start with stack.com.
To match stack.com but not inside stack.com/wp-admin, you need a negative lookahead:
/stack\.com(?!\/wp-admin)/
^^^^^^^^^^^^^
Or better, with word boundaries to only match whole words:
/\bstack\.com\b(?!\/wp-admin)/
See the regex demo
Details:
\b - a leading word boundary
stack\.com - a literal string stack.com (a dot must be escaped)
\b - a trailing word boundary
(?!\/wp-admin) - a negative lookahead that fails the match if there is /wp-admin immediately to the right of the current location.

How to find words that contain string with a limited size

I need to find all the words in an inputted text that has (?i:val) in it and are no longer that 5 characters.
So far I got: \b([a-zA-Z]*(?i:val)[a-zA-Z]*){1,4}\b
If we take this sample text to look in: In computer science, a value is an expression which cannot be evaluated any further (a normal form). Val is also a match
I get 3 matches (value, evaluated and Val), however evaluated should not match the pattern, as it is too long. What is the right way to get this straight?
Your pattern does not account for the length of the words matched.
Use word boundaries and a lookahead like this:
(?i)\b(?=\w*val)\w{1,5}\b
See regex demo
The regex matches:
\b - a leading word boundary since the next pattern is \w
(?=\w*val) - a lookahead making sure there is a val substring after zero or more word characters
\w{1,5} - matches 1 to 5 word characters
\b - trailing word boundary that stops words of more than 5 characters long from matching
You may use an ASCII JS version of the regex:
/\b(?=[a-z]*val)[a-z]{1,5}\b/i
It's important to understand why the "evaluated" was matched. Note:
[a-zA-Z]* matches the "e"
(?i:val) matches "val"
[a-zA-Z]* matches "uated"
Actually there's not repetition here! The pattern was matched in only one iteration.
You can achieve what you want using lookarounds, but I think that regex is not the best tool for this task. I highly recommend you using other functions depending on what you have.