Unmatch complete words if a negative lookahead is satisfied - regex

I need to match only those words which doesn't have special characters like # and :.
For example:
git#github.com shouldn't match
list should return a valid match
show should also return a valid match
I tried it using a negative lookahead \w+(?![#:])
But it matches gi out of git#github.com but it shouldn't match that too.

You may add \w to the lookahead:
\w+(?![\w#:])
The equivalent is using a word boundary:
\w+\b(?![#:])
Besides, you may consider adding a left-hand boundary to avoid matching words inside non-word non-whitespace chunks of text:
^\w+(?![\w#:])
Or
(?<!\S)\w+(?![\w#:])
The ^ will match the word at the start of the string and (?<!S) will match only if the word is preceded with whitespace or start of string.
See the regex demo.
Why not (?<!\S)\w+(?!\S), the whitespace boundaries? Because since you are building a lexer, you most probably have to deal with natural language sentences where words are likely to be followed with punctuation, and the (?!\S) negative lookahead would make the \w+ match only when it is followed with whitespace or at the end of the string.

You can use negative lookbehind and negative lookahead patterns around a word pattern to make sure that the word is not preceded or followed by a non-space character, or in other words, to make sure that it is surrounded by either a space or a string boundary:
(?<!\S)\w+(?!\S)
Demo: https://regex101.com/r/cjhUUM/2

Related

Regex that doesn't recognise a pattern

I want to make a regex that recognize some patterns and some not.
_*[a-zA-Z][a-zA-Z0-9_][^-]*.*(?<!_)
The sample of patterns that i want to recognize:
a100__version_2
_a100__version2
And the sample of patterns that i dont want to recognize:
100__version_2
a100__version2_
_100__version_2
a100--version-2
The regex works for all of them except this one:
a100--version-2
So I don't want to match the dashes.
I tried _*[a-zA-Z][a-zA-Z0-9_][^-]*.*(?<!_)
so the problem is at [^-]
You could write the pattern like this, but [^-]* can also match newlines and spaces.
To not match newlines and spaces, and matching at least 2 characters:
^_*[a-zA-Z][a-zA-Z0-9_][^-\s]*$(?<!_)
Regex demo
Or matching only word characters, matching at least a single character repeating \w* zero or more times:
^_*[a-zA-Z]\w*$(?<!_)
^ Start of string
_* Match optional underscores
[a-zA-Z] Match a single char a-zA-Z
\w* Match optional word chars (Or [a-zA-Z0-9_]*)
$ End of string
(?<!_) Assert not _ to the left at the end of the string
Regex demo

Regex boundary to also exclude special characters

I would like to make a regex to match a word, but don't match it if there are special characters on its sides.
I tried to use a word boundary (\b) on both sides but it doesn't seem to exclude special characters...
For example, this should work:
text word-to-match more-text
But this should not:
text word-to-match-more-text
Because there is a - between the word to match and more text.
What i have now is this:
(?<=[^-\[\]{}()+?.,\\^$|#])\bword-to-match\b(?=[^-\[\]{}()+?.,\\^$|#])
I would like to know if there is a more elegant way instead of using [^-\[\]{}()+?.,\\^$|#]) on both sides of the word.
Thanks in advance!
You may use lookahead and lookbehind on both sides to fail the match if there is a non-whitespace character on either side:
(?<!\S)word-to-match(?!\S)
RegEx Demo
(?<!\S): Fail if previous character is a non-whitespace
(?!\S): Fail if next character is a non-whitespace

Regex for spoof

I would like to ask for help regarding my problem when it comes to spoofing let say usernames and I want to catch them using regex.
for example the correct username is :
rolf
and here are the spoofed versions that I could think of:
roooolf
r123olf
123rolf123
rolf5623
123rolf
rollllf
rrrrrrolf
rolffff
So basically I have this regex expression ( that I know is not sufficient because I've tried it on regex101 website )
.+(?![rolf]).+
I'm using this as a baseline because it doesnt catch the correct username which is :
rolf
but it doesn't catch all the other "spoofed" versions of the username.
Any Ideas how can I make my regex more efficient?
Thanks in advance!
You may try this too
(?m)^(?![^\n]*?rolf[^\n]*$).*$
Demo
To match not exactly rolf You can use a negative lookahead (?! to assert that what follows from the beginning of the string is not 'rolf' until the end of the string.
^(?!rolf$).+$
That would match
^ Assert position at the begin of the string
(?! Negative lookahead that asserts that what follows is not
rolf Match literally
) Close negative lookahead
.+ Match any character one or more times
$Assert position at the end of the string
From your example regex you match .+ where #Ωmega has a fair point, matches spaces.
Instead of .+ you could specify what characters you might accept like \w+ for example to match one or more word characters or specify more using a character class.
You can use a regex pattern
\b(?!rolf\b)\S+\b
\b Word boundary - Matches a word boundary position between a
word character and non-word character or position (start / end of
string).
(?! Negative lookahead - Specifies a group that can not match
after the main expression (if it matches, the result is discarded).
\S Not whitespace - Matches any character that is not a
whitespace character (spaces, tabs, line breaks).
+ Quantifier - Match 1 or more of the preceding token.
Test your inputs with this pattern here.

Remove non-alphabetic words from a sentence using regex

Is it possible to remove words in a sentence that doesn't contain a-z letters? I've thought about negative look arounds but wasn't successful.
For example,
This is a 1-2-a3 sample 12 -- 7-8 sentence
becomes
This is a 1-2-a3 sample sentence
Assume all other punctuations were removed except dashes.
Thanks!
The below regex would match those words which won't contain an alphabet.
(?<!\S)[^a-zA-Z\s]+(?!\S)
DEMO
Just replace those matched words with an empty string to get your desired output. (?<!\S) negative lookbehind which asserts that the match won't be preceded by a non-space character. (?!\S) negative lookahead which asserts that the match won't be followed by a non-space character.

Find slash that are NOT followed by non word character

I am trying to write a regex for finding slashes only that are not followed by special characters.
For example, if the string is,
/PErs/#loc/g/2, then I regex should find slashes (/) that are before P, g and 2. It should not return slash before # as # is a special character.
I could write \/\w but it is returning me /P, /g and /2.
Simplest one by using word boundary \b.
\/\b
\b matches between a word character and a non-word character.
DEMO
You want to use the lookahead operator.
Positive lookahead or detect if something is present after (ahead)
Try this regex instead:
\/(?=\w)
DEMO
We use here the positive lookahead operator (?=). It will "detect" the position of a given expression but won't match the expression.
Negative lookahead or detect if something is NOT present after (ahead)
Alternatively, you can also use the negative look ahead operator (?!).
\/(?![#])
DEMO
Negative lookahead with multiple special characters
This will match any / NOT followed by #. If you have more special characters, simply add them to the character class.
For example, if # and % were special characters, the regular expression above would become:
\/(?![##%])
DEMO
Matching slashes NOT followed by NON word character is not the same than followed by word character.
Have a try with:
/(?!\W)
This matches slashes NOT followed by NON word character
It matches the final slash in string: PErs/