Perl Regex "Not" (negative lookahead) - regex

I'm not terribly certain what the correct wording for this type of regex would be, but basically what I'm trying to do is match any string that starts with "/" but is not followed by "bob/", as an example.
So these would match:
/tom/
/tim/
/steve
But these would not
tom
tim
/bob/
I'm sure the answer is terribly simple, but I had a difficult time searching for "regex not" anywhere. I'm sure there is a fancier word for what I want that would pull good results, but I'm not sure what it would be.
Edit: I've changed the title to indicate the correct name for what I was looking for

You can use a negative lookahead (documented under "Extended Patterns" in perlre):
/^\/(?!bob\/)/

TLDR: Negative Lookaheads
If you wanted a negative lookahead just to find "foo" when it isn't followed by "bar"...
$string =~ m/foo(?!bar)/g;
Working Demo Online
Source
To quote the docs...
(?!pattern)
(*nla:pattern)
#(*negative_lookahead:pattern)
A zero-width negative lookahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that lookahead and lookbehind are NOT the same thing. You cannot use this for lookbehind. (Source: PerlDocs.)
Negative Lookaheads For Your Case
The accepted answer is great, but it leaves no explanation, so let me add one...
/^\/(?!bob\/)/
^ — Match only the start of strings.
\/ — Match the / char, which we need to escape because it is a character in the regex format (i.e. s/find/replacewith/, etc.).
(?!...) — Do not match if the match is followed by ....
bob\/ — This is the ... value, don't match bob/', once more, we need to escape the /`.

Related

Regex Negative Lookbehind Matches Lookbehind text .NET

Say I have the following strings:
PB-GD2185-11652-MTCH
GD2185-11652-MTCH
KD-GD2185-11652-MTCH
KD-GD2185-11652
I want REGEX.IsMatch to return true if the string has MTCH in it and does not start with PB.
I expected the regex to be the following:
^(?<!PB)\S+(?=MTCH)
but that gives me the following matches:
PB-GD2185-11652-
GD2185-11652-
KD-GD2185-11652-
I do not understand why the negative lookbehind not only doesn't exclude the match but includes the PB characters in the match. The positive lookahead works as expected.
EDIT 1
Let me start with a simpler example. The following regex matches all of the strings as I would expect it to:
\S+
The following regex still matches all of the strings even though I would expect it not to:
\S+(?!MTCH)
The following regex matches all but the final H character on the first three strings:
\S+(?<!MTCH)
From the documentation at regex 101, a lookahead looks for text to the right of the pattern and a lookbehind looks for text to the left of the pattern, so having a lookahead at the beginning of a string does not jive with the documentation.
Edit 2
take another example with the following three strings:
grey
greyhound
hound
the regex:
^(?<!grey)hound
only matches the final hound. whereas the regex:
^(?<!grey)\S+
matches all three.
You need a lookahead: ^(?!PB)\S+(?=MTCH). Using the look-behind means the PB has to come before the first character.
The problem was because of the greediness of \S+. When dealing with lookarounds and greedy quantifiers you can easily match more characters than you expect. One way to deal with this is to insert a negative lookaround in a group with the greedy quantifier to exclude it as a match as stated in this question:
How to non-greedy multiple lookbehind matches
and on this helpful website about greediness in regular expressions:
http://www.rexegg.com/regex-quantifiers.html
Note that this second link has a few other ways to deal with the greediness in various situations.
A good regular expression for this situation is as follows:
^(?<!PB)((?!PB)\S+)(MTCH)
In situations like this it is going to be much clearer to do it logically within the code. So first check if the string matches MTCH and then that it doesn't match ^PB

skipping comments with regex

this has been asked so many times - yet I don't get why the following negative look-behind still matches after the comment character ";" ?!
(?<!;).+mylib.*
Debuggex Demo
TEST-TEXT:
; /home/mylib/blabla/laydef1.rul (matches wrongly!?)
/home/mylib/blabla/laydef2.rul (matches as it should)
P.S. RegEx class is PCRE
Since PCRE doesn't support variable length lookbehind you can use this regex construct:
/^\h*(?:;.*(*SKIP)(*F)|.*mylib.*)/m
RegEx Demo
Your regex: (?<!;).+mylib.* fails because .+ matches everything from ; tomylib`
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.

RegEx lookahead but not immediately following

I am trying to match terms such as the Dutch ge-berg-te. berg is a noun by itself, and ge...te is a circumfix, i.e. geberg does not exist, nor does bergte. gebergte does. What I want is a RegEx that matches berg or gebergte, working with a lookaround. I was thinking this would work
\b(?i)(ge(?=te))?berg(te)?\b
But it doesn't. I am guessing because a lookahead only checks the immediate following characters, and not across characters. Is there any way to match characters with a lookahead withouth the constraint that those characters have to be immediately behind the others?
Valid matches would be:
Berg
berg
Gebergte
gebergte
Invalid matches could be:
Geberg
geberg
Bergte
bergte
ge-/Ge- and -te always have to occur together. Note that I want to try this with a lookahead. I know it can be done simpler, but I want to see if its methodologically possible to do something like this.
Here is one non-lookaround based regex:
\b(berg|gebergte)\b
Use it with i (ignore case) flag. This regex uses alternation and word boundary to search for complete words berg OR gebergte.
RegEx Demo
Lookaround based regex:
(?<=\bge)berg(?=te\b)|\bberg\b
This regex used a lookahead and lookbehind to search for berg preceded by ge and followed by te. Alternatively it matches complete word berg using word boundary asserter \b which is also 0-width asserter like anchors ^ and $.
To generally forbid a sign, you can put the negative lookaround to the beginning of a string and combine it with random number of other signs before the string you want to forbid:
regex: don't match if containing a specific string
^(?!.\*720).*
This will not match, if the string contains 720, but else match everything else.

How does the regular expression ‘(?<=#)[^#]+(?=#)’ work?

I have the following regex in a C# program, and have difficulties understanding it:
(?<=#)[^#]+(?=#)
I'll break it down to what I think I understood:
(?<=#) a group, matching a hash. what's `?<=`?
[^#]+ one or more non-hashes (used to achieve non-greediness)
(?=#) another group, matching a hash. what's the `?=`?
So the problem I have is the ?<= and ?< part. From reading MSDN, ?<name> is used for naming groups, but in this case the angle bracket is never closed.
I couldn't find ?= in the docs, and searching for it is really difficult, because search engines will mostly ignore those special chars.
They are called lookarounds; they allow you to assert if a pattern matches or not, without actually making the match. There are 4 basic lookarounds:
Positive lookarounds: see if we CAN match the pattern...
(?=pattern) - ... to the right of current position (look ahead)
(?<=pattern) - ... to the left of current position (look behind)
Negative lookarounds - see if we can NOT match the pattern
(?!pattern) - ... to the right
(?<!pattern) - ... to the left
As an easy reminder, for a lookaround:
= is positive, ! is negative
< is look behind, otherwise it's look ahead
References
regular-expressions.info/Lookarounds
But why use lookarounds?
One might argue that lookarounds in the pattern above aren't necessary, and #([^#]+)# will do the job just fine (extracting the string captured by \1 to get the non-#).
Not quite. The difference is that since a lookaround doesn't match the #, it can be "used" again by the next attempt to find a match. Simplistically speaking, lookarounds allow "matches" to overlap.
Consider the following input string:
and #one# and #two# and #three#four#
Now, #([a-z]+)# will give the following matches (as seen on rubular.com):
and #one# and #two# and #three#four#
\___/ \___/ \_____/
Compare this with (?<=#)[a-z]+(?=#), which matches:
and #one# and #two# and #three#four#
\_/ \_/ \___/ \__/
Unfortunately this can't be demonstrated on rubular.com, since it doesn't support lookbehind. However, it does support lookahead, so we can do something similar with #([a-z]+)(?=#), which matches (as seen on rubular.com):
and #one# and #two# and #three#four#
\__/ \__/ \____/\___/
References
regular-expressions.info/Flavor Comparison
As another poster mentioned, these are lookarounds, special constructs for changing what gets matched and when. This says:
(?<=#) match but don't capture, the string `#`
when followed by the next expression
[^#]+ one or more characters that are not `#`, and
(?=#) match but don't capture, the string `#`
when preceded by the last expression
So this will match all the characters in between two #s.
Lookaheads and lookbehinds are very useful in many cases. Consider, for example, the rule "match all bs not followed by an a." Your first attempt might be something like b[^a], but that's not right: this will also match the bu in bus or the bo in boy, but you only wanted the b. And it won't match the b in cab, even though that's not followed by an a, because there are no more characters to match.
To do that correctly, you need a lookahead: b(?!a). This says "match a b but don't match an a afterwards, and don't make that part of the match". Thus it'll match just the b in bolo, which is what you want; likewise it'll match the b in cab.
They're called look-arounds: http://www.regular-expressions.info/lookaround.html

How to negate the whole regex?

I have a regex, for example (ma|(t){1}). It matches ma and t and doesn't match bla.
I want to negate the regex, thus it must match bla and not ma and t, by adding something to this regex. I know I can write bla, the actual regex is however more complex.
Use negative lookaround: (?!pattern)
Positive lookarounds can be used to assert that a pattern matches. Negative lookarounds is the opposite: it's used to assert that a pattern DOES NOT match. Some flavor supports assertions; some puts limitations on lookbehind, etc.
Links to regular-expressions.info
Lookahead and Lookbehind Zero-Width Assertions
Flavor comparison
See also
How do I convert CamelCase into human-readable names in Java?
Regex for all strings not containing a string?
A regex to match a substring that isn’t followed by a certain other substring.
More examples
These are attempts to come up with regex solutions to toy problems as exercises; they should be educational if you're trying to learn the various ways you can use lookarounds (nesting them, using them to capture, etc):
codingBat plusOut using regex
codingBat repeatEnd using regex
codingbat wordEnds using regex
Assuming you only want to disallow strings that match the regex completely (i.e., mmbla is okay, but mm isn't), this is what you want:
^(?!(?:m{2}|t)$).*$
(?!(?:m{2}|t)$) is a negative lookahead; it says "starting from the current position, the next few characters are not mm or t, followed by the end of the string." The start anchor (^) at the beginning ensures that the lookahead is applied at the beginning of the string. If that succeeds, the .* goes ahead and consumes the string.
FYI, if you're using Java's matches() method, you don't really need the the ^ and the final $, but they don't do any harm. The $ inside the lookahead is required, though.
\b(?=\w)(?!(ma|(t){1}))\b(\w*)
this is for the given regex.
the \b is to find word boundary.
the positive look ahead (?=\w) is here to avoid spaces.
the negative look ahead over the original regex is to prevent matches of it.
and finally the (\w*) is to catch all the words that are left.
the group that will hold the words is group 3.
the simple (?!pattern) will not work as any sub-string will match
the simple ^(?!(?:m{2}|t)$).*$ will not work as it's granularity is full lines
This regexp math your condition:
^.*(?<!ma|t)$
Look at how it works:
https://regex101.com/r/Ryg2FX/1
Apply this if you use laravel.
Laravel has a not_regex where field under validation must not match the given regular expression; uses the PHP preg_match function internally.
'email' => 'not_regex:/^.+$/i'