Regex: what does (?! ...) mean? - regex

The following regex finds text between substrings FTW and ODP.
/FTW(((?!FTW|ODP).)+)ODP+/
What does the (?!...) do?

(?!regex) is a zero-width negative lookahead. It will test the characters at the current cursor position and forward, testing that they do NOT match the supplied regex, and then return the cursor back to where it started.
The whole regexp:
/
FTW # Match Characters 'FTW'
( # Start Match Group 1
( # Start Match Group 2
(?!FTW|ODP) # Ensure next characters are NOT 'FTW' or 'ODP', without matching
. # Match one character
)+ # End Match Group 2, Match One or More times
) # End Match Group 1
OD # Match characters 'OD'
P+ # Match 'P' One or More times
/
So - Hunt for FTW, then capture while looking for ODP+ to end our string. Also ensure that the data between FTW and ODP+ doesn't contain FTW or ODP

From perldoc:
A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.
If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. You would have to do something like /(?!foo)...bar/ for that. We say "like" because there's the case of your "bar" not having three characters before it. You could cover that this way: /(?:(?!foo)...|^.{0,2})bar/ . Sometimes it's still easier just to say:
if (/bar/ && $` !~ /foo$/)

It means "not followed by...". Technically this is what's called a negative lookahead in that you can peek at what's ahead in the string without capturing it. It is a class of zero-width assertion, meaning that such expressions don't capture any part of the expression.

The programmer must have been typing too fast. Some characters in the pattern got flipped. Corrected:
/WTF(((?!WTF|ODP).)+)ODP+/

Regex
/FTW(((?!FTW|ODP).)+)ODP+/
matches first FTW immediately followed neither by FTW nor by ODP, then all following chars up to the first ODP (but if there is FTW somewhere in them there will be no match) then all the letters P that follow.
So in the string:
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj
it will match the bold part
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj

'?!' is actually part of '(?! ... )', it means whatever is inside must NOT match at that location.

Related

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

Perl: Matching string not containing PATTERN

While using Perl regex to chop a string down into usable pieces I had the need to match everything except a certain pattern. I solved it after I found this hint on Perl Monks:
/^(?:(?!PATTERN).)*$/; # Matches strings not containing PATTERN
Although I solved my initial problem, I have little clue about how it actually works. I checked perlre, but it is a bit too formal to grasp.
Regular expression to match a line that doesn't contain a word? helps a lot in understanding, but why is the . in my example and the ?: and how do the outer parentheses work?
Can someone break up the regex and explain in simple words how it works?
Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):
This matches any string:
/^.*$/
But we don't want . to match a character that starts PATTERN, so replace
.
with
(?!PATTERN).
This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:
if PATTERN doesn't match at this point,
match the next character
This needs to be done for every character in the string, so * is used to match zero or more times, from the beginning to the end of the string.
To make the * apply to the combination of the negative look-ahead and ., not just the ., it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: ):
(?:(?!PATTERN).)*
And putting back the anchors to make sure we test at every position in the string:
/^(?:(?!PATTERN).)*$/
Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo and later baz but no bar in between:
/foo(?:(?!bar).)*baz/
If there aren't such considerations, you can simply do:
/^(?!.*PATTERN)/
to check that PATTERN does not match anywhere in the string.
About newlines: there are two problems with your regex and newlines. First, . doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/ doesn't match, even though the string does not contain baz. You need to add the /s flag to make . match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s correctly matches. Second, $ doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s correctly fails to match the string that does in fact contain a \s. So the correct general purpose regex is:
/^(?:(?!PATTERN).)*\z/s
jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.
For instance, here is the RegexBuddy output:
"
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?: # Match the regular expression below
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
PATTERN # Match the character string “PATTERN” literally (case insensitive)
)
. # Match any single character that is NOT a line break character (line feed)
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
# Perl 5.18 allows a zero-length match at the position where the previous match ends.
# Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"
Some people also use regex101.
A Human Explanation
Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^ and the $, implying that the only possible match is the whole string, not a substring of that string.
Then we come to the meat: a non-capturing group introduced by (?: and repeated any number of times by the *
What does this group do? It contains
a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
then a dot to match the next character
This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.
If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.

Regex match any character NOT followed by "? something"

How can I match a path only if there is no "?" plus zero or more character on the end.
I have the following path:
/something/contentimg/coast03.jpg?itok=ABC
I want the filename, but only if there is no "?something" after the file extension.
I tried:
/^.*\/(.*)(?!\?.*)$/
But it matches anyway. This is the result. What am I doing wrong?
Array
(
[0] => /something/contentimg/coast03.jpg?itok=ABC
[1] => coast03.jpg?itok=ABC
)
Using php.
Use parse_url:
print_r(parse_url('/something/contentimg/coast03.jpg?itok=ABC'))
(
[path] => /something/contentimg/coast03.jpg
[query] => itok=ABC
)
The * quantifier behaves greedily and matches everything up to the end of the regular expression, so the negative lookahead kicks in at the end of the input (and of course doesn't find what it's looking for). The regex should be done a little differently:
/^.*\/([^?]+)$/
This expression matches one or more non-question-mark characters and then asserts that it has reached the end of the input string, which is what you want to do.
^.*\/([^?]+)(?![?].+)$
Working DEMO
Your expression does not work, because (.*) matches everything after last \, so there is nothing that could be considered as negative lookahead input.
This is how it's currently matching:
.* - greedily matches up to before the last / - /something/contentimg
\/ - matches /
(.*) - matches the rest of the string - coast03.jpg?itok=ABC
(?!\?.*) - checks that the characters following don't match, since we are at the end already, it obviously won't match.
What you should do:
It seems like you can just check if a ? exists in the string, so try:
/^(?!.*\?)/
Or match up to the last /, then check for a ? from there:
/^(?!.*\/.*\?)/
Explanation:
You already know (?!...) is negative look-ahead, you're just not entirely sure how to use it. Wherever you put it, it tries its best to match the given pattern from that position onwards. If it succeeds, the regex doesn't match. So it might be a good idea to put this at the very beginning and try to match the rest of the string.
So the basic format for this example is:
/^(?!...).*$/
where (?!...) contains a pattern for the strings you want to exclude.
The .*$ at the end shouldn't be required, and if you want to check the entire string, remember the $ at the end of the look-ahead.
/^(?!...$)/

General approach for (equivalent of) "backreferences within character class"?

In Perl regexes, expressions like \1, \2, etc. are usually interpreted as "backreferences" to previously captured groups, but not so when the \1, \2, etc. appear within a character class. In the latter case, the \ is treated as an escape character (and therefore \1 is just 1, etc.).
Therefore, if (for example) one wanted to match a string (of length greater than 1) whose first character matches its last character, but does not appear anywhere else in the string, the following regex will not do:
/\A # match beginning of string;
(.) # match and capture first character (referred to subsequently by \1);
[^\1]* # (WRONG) match zero or more characters different from character in \1;
\1 # match \1;
\z # match the end of the string;
/sx # s: let . match newline; x: ignore whitespace, allow comments
would not work, since it matches (for example) the string 'a1a2a':
DB<1> ( 'a1a2a' =~ /\A(.)[^\1]*\1\z/ and print "fail!" ) or print "success!"
fail!
I can usually manage to find some workaround1, but it's always rather problem-specific, and usually far more complicated-looking than what I would do if I could use backreferences within a character class.
Is there a general (and hopefully straightforward) workaround?
1 For example, for the problem in the example above, I'd use something like
/\A
(.) # match and capture first character (referred to subsequently
# by \1);
(?!.*\1\.+\z) # a negative lookahead assertion for "a suffix containing \1";
.* # substring not containing \1 (as guaranteed by the preceding
# negative lookahead assertion);
\1\z # match last character only if it is equal to the first one
/sx
...where I've replaced the reasonably straightforward (though, alas, incorrect) subexpression [^\1]* in the earlier regex with the somewhat more forbidding negative lookahead assertion (?!.*\1.+\z). This assertion basically says "give up if \1 appears anywhere beyond this point (other than at the last position)." Incidentally, I give this solution just to illustrate the sort of workarounds I referred to in the question. I don't claim that it is a particularly good one.
This can be accomplished with a negative lookahead within a repeated group:
/\A # match beginning of string;
(.) # match and capture first character (referred to subsequently by \1);
((?!\1).)* # match zero or more characters different from character in \1;
\1 # match \1;
\z # match the end of the string;
/sx
This pattern can be used even if the group contains more than one character.

How to match, but not capture, part of a regex?

I have a list of strings. Some of them are of the form 123-...456. The variable portion "..." may be:
the string "apple" followed by a hyphen, e.g. 123-apple-456
the string "banana" followed by a hyphen, e.g. 123-banana-456
a blank string, e.g. 123-456 (note there's only one hyphen)
Any word other than "apple" or "banana" is invalid.
For these three cases, I would like to match "apple", "banana", and "", respectively. Note that I never want capture the hyphen, but I always want to match it. If the string is not of the form 123-...456 as described above, then there is no match at all.
How do I write a regular expression to do this? Assume I have a flavor that allows lookahead, lookbehind, lookaround, and non-capturing groups.
The key observation here is that when you have either "apple" or "banana", you must also have the trailing hyphen, but you don't want to match it. And when you're matching the blank string, you must not have the trailing hyphen. A regex that encapsulates this assertion will be the right one, I think.
The only way not to capture something is using look-around assertions:
(?<=123-)((apple|banana)(?=-456)|(?=456))
Because even with non-capturing groups (?:…) the whole regular expression captures their matched contents. But this regular expression matches only apple or banana if it’s preceded by 123- and followed by -456, or it matches the empty string if it’s preceded by 123- and followed by 456.
Lookaround
Name
What it Does
(?=foo)
Lookahead
Asserts that what immediately FOLLOWS the current position in the string is foo
(?<=foo)
Lookbehind
Asserts that what immediately PRECEDES the current position in the string is foo
(?!foo)
Negative Lookahead
Asserts that what immediately FOLLOWS the current position in the string is NOT foo
(?<!foo)
Negative Lookbehind
Asserts that what immediately PRECEDES the current position in the string is NOT foo
In javascript try: /123-(apple(?=-)|banana(?=-)|(?!-))-?456/
Remember that the result is in group 1
Debuggex Demo
Based on the input provided by Germán Rodríguez Herrera
Try:
123-(?:(apple|banana|)-|)456
That will match apple, banana, or a blank string, and following it there will be a 0 or 1 hyphens. I was wrong about not having a need for a capturing group. Silly me.
I have modified one of the answers (by #op1ekun):
123-(apple(?=-)|banana(?=-)|(?!-))-?456
The reason is that the answer from #op1ekun also matches "123-apple456", without the hyphen after apple.
Try this:
/\d{3}-(?:(apple|banana)-)?\d{3}/
A variation of the expression by #Gumbo that makes use of \K for resetting match positions to prevent the inclusion of number blocks in the match. Usable in PCRE regex flavours.
123-\K(?:(?:apple|banana)(?=-456)|456\K)
Matches:
Match 1 apple
Match 2 banana
Match 3
By far the simplest (works for python) is '123-(apple|banana)-?456'.