Regex negation - word parsing - regex

I am trying to parse a phrase and exclude common words.
For instance in the phrase "as the world turns", I want to exclude the common words "as" and "the" and return only "world" and "turns".
(\w+(?!the|as))
Doesn't work. Feedback appreciated.

The lookahead should come first:
(\b(?!(the|as)\b)\w+\b)
I have also added word boundaries to ensure that it only matches whole words otherwise it would fail to match the complete word "as" but it would successfully match the letter "s" of that word.
You might also want to consider what \w matches and if that meets your needs. If you are looking for words in English you probably are interested in letters but not digits and you may wish to include some punctuation characters that are excluded by \w, such as apostrophes. You could try something like this instead (Rubular):
/(\b(?!(?:the|as)\b)[a-z'-]+\b)/i
To match words more accurately in a human language you could consider using a natural language parsing library instead of regular expressions.

You should use word boundaries to only match whole words. Either with a look-ahead assertion:
(\b(?!(?:the|as)\b)\w+\b)
Or with a look-behind assertion:
(\b\w+\b(?<!\b(?:the|as)))

Related

Regex not extracting all matching words

I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)

Regex to match other than listed string

I need to select a value which not listed in following string including all special characters.
List of string and requirement that need to rejected:
XNIL
SNIL
All special characters
My expression is like this (?!XNIL|SNIL|[\W])\w+
The problem is, if my text have a word XNIL or SNIL, it still allow the word NIL. But i have listed the word XNIL and SNIL to be rejected. Any mistake did i made here?
You can check my regex online here -> http://regexr.com/3cdsl
This seems to work on your test page: (?!(XNIL|SNIL|\W+))\b\w+ At least it solves the XNIL/SNIL problem.
The reason why your regex was matching XNIL was it was matching from the \w+. To see why, take your original and change \w+ to \w and notice the difference.
UPDATE:
Based on your feedback, you also wish to exclude _.
Because _ is used in programming language symbols, and [arguably] regexes were created, of, by, and for programmers, _ is considered a "word" char (i.e. it's in \w and therefore not excluded by \W).
From the [perl] regex man page:
\w Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks)
Your final regex might need to be: (?!(XNIL|SNIL|_+|\W+))\b\w+. (Note: the _+)
A cleaner way: (?!(XNIL|SNIL|[\W_]+))\b\w+ which produces the same results yet is closer in intent to what you wanted.
You may have to adjust \w+ accordingly as well
If you really want to be sure, at the expense of being slightly more verbose, write out the character class as you choose:
(?!(XNIL|SNIL|[^a-zA-Z0-9]+))\b[a-zA-Z0-9]+
Check this regex
[^(XNIL|SNIL|[^\w])]
Explanation
[] having ^ at beginning says the that any thing that is not there in the list given in [] should be matched.
(XNIL|SNIL|[^\w+]) matches words XNIL or SNIL or [^\w] matches anything other than words(i.e. special chars)
So the whole regex matches any thing that is not there in [^(XNIL|SNIL|[^\w])]
This should work
(?m)^(((?!XNIL|SNIL|[\W]).)*)$
Grouping the character match with the negative lookahead will cause the zero length assertion to continue until finished (in this case at the end of the string due to $)

RegEx lookahead but not immediately following

I am trying to match terms such as the Dutch ge-berg-te. berg is a noun by itself, and ge...te is a circumfix, i.e. geberg does not exist, nor does bergte. gebergte does. What I want is a RegEx that matches berg or gebergte, working with a lookaround. I was thinking this would work
\b(?i)(ge(?=te))?berg(te)?\b
But it doesn't. I am guessing because a lookahead only checks the immediate following characters, and not across characters. Is there any way to match characters with a lookahead withouth the constraint that those characters have to be immediately behind the others?
Valid matches would be:
Berg
berg
Gebergte
gebergte
Invalid matches could be:
Geberg
geberg
Bergte
bergte
ge-/Ge- and -te always have to occur together. Note that I want to try this with a lookahead. I know it can be done simpler, but I want to see if its methodologically possible to do something like this.
Here is one non-lookaround based regex:
\b(berg|gebergte)\b
Use it with i (ignore case) flag. This regex uses alternation and word boundary to search for complete words berg OR gebergte.
RegEx Demo
Lookaround based regex:
(?<=\bge)berg(?=te\b)|\bberg\b
This regex used a lookahead and lookbehind to search for berg preceded by ge and followed by te. Alternatively it matches complete word berg using word boundary asserter \b which is also 0-width asserter like anchors ^ and $.
To generally forbid a sign, you can put the negative lookaround to the beginning of a string and combine it with random number of other signs before the string you want to forbid:
regex: don't match if containing a specific string
^(?!.\*720).*
This will not match, if the string contains 720, but else match everything else.

Regular Expression - match whole string or pattern

I'm doing a file search, and I want to exclude files that contain min in them, but I don't want to match parts of words.
so I want to exclude:
a directory named min
also a file named jhkjahdf-min.txt
But I don't want to exclude:
a file named mint.txt
Thank you for your help in advanced. Explanations of the regular expression you give would be a lot of extra help.
I'm using PHP's preg_match() function
There are things called word boundaries (\b). They do not consume a character, but match if you go from a word character (letters, digits, underscores) to a non-word character of vice-versa. So this is a first good approximation:
`\bmin\b`
Now if you want to consider digits and underscores as word boundaries, too, it gets a bit more complicated. Then you need negative lookaheads and lookbehinds, which are not supported by all regex engines:
`(?<![a-zA-Z])min(?![a-zA-Z])`
These will also not be included in the match (if you care about it at all), but just assert that min is neither preceded nor followed by a letter.
You'll want to anchor your regex with \b which means "word boundary."

Perl regular expression for English word

I need a regular expression that will find anything that looks like an English word. In particular, I want the expression to match when a string has:
1) only letters; and
2) at least two different letters. (I am purposely excluding one-letter words.)
So I'm looking for something that would match the and abracadabra but not aaa.
Any help is much appreciated.
Perhaps \b(\w*(\w)\w*(?!\2)\w+)\b works for you. It handles the examples you give.
It matches a letter \w in a group, then looks for something other than than letter using backreferences and negative lookahead (?!\2). We match at least one character at the end, which is necessary to make the negative lookahead force at least one distinct character. Then we place additional \w*'s around to allow additional letters. \b assures the ends of the matches are at word boundaries.
http://www.rubular.com/r/pwjGi9eLf5
Please note that this is no super duper regular expression that matches English-only words. For that, you want to compare against a dictionary. But that doesn't seem to be what you're looking to do here.
Check out Lingua::EN::Splitter:
use strict; use warnings;
use Lingua::EN::Splitter qw(words);
my #words = words $input_text;
print #words;