Regular Expressions and negating a whole character group [duplicate] - regex

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 5 years ago.
I'm attempting something which I feel should be fairly obvious to me but it's not. I'm trying to match a string which does NOT contain a specific sequence of characters. I've tried using [^ab], [^(ab)], etc. to match strings containing no 'a's or 'b's, or only 'a's or only 'b's or 'ba' but not match on 'ab'. The examples I gave won't match 'ab' it's true but they also won't match 'a' alone and I need them to. Is there some simple way to do this?

Using a character class such as [^ab] will match a single character that is not within the set of characters. (With the ^ being the negating part).
To match a string which does not contain the multi-character sequence ab, you want to use a negative lookahead:
^(?:(?!ab).)+$
And the above expression disected in regex comment mode is:
(?x) # enable regex comment mode
^ # match start of line/string
(?: # begin non-capturing group
(?! # begin negative lookahead
ab # literal text sequence ab
) # end negative lookahead
. # any single character
) # end non-capturing group
+ # repeat previous match one or more times
$ # match end of line/string

Use negative lookahead:
^(?!.*ab).*$
UPDATE: In the comments below, I stated that this approach is slower than the one given in Peter's answer. I've run some tests since then, and found that it's really slightly faster. However, the reason to prefer this technique over the other is not speed, but simplicity.
The other technique, described here as a tempered greedy token, is suitable for more complex problems, like matching delimited text where the delimiters consist of multiple characters (like HTML, as Luke commented below). For the problem described in the question, it's overkill.
For anyone who's interested, I tested with a large chunk of Lorem Ipsum text, counting the number of lines that don't contain the word "quo". These are the regexes I used:
(?m)^(?!.*\bquo\b).+$
(?m)^(?:(?!\bquo\b).)+$
Whether I search for matches in the whole text, or break it up into lines and match them individually, the anchored lookahead consistently outperforms the floating one.

Yes its called negative lookahead. It goes like this - (?!regex here). So abc(?!def) will match abc not followed by def. So it'll match abce, abc, abck, etc.
Similarly there is positive lookahead - (?=regex here). So abc(?=def) will match abc followed by def.
There are also negative and positive lookbehind - (?<!regex here) and (?<=regex here) respectively
One point to note is that the negative lookahead is zero-width. That is, it does not count as having taken any space.
So it may look like a(?=b)c will match "abc" but it won't. It will match 'a', then the positive lookahead with 'b' but it won't move forward into the string. Then it will try to match the 'c' with 'b' which won't work. Similarly ^a(?=b)b$ will match 'ab' and not 'abb' because the lookarounds are zero-width (in most regex implementations).
More information on this page

abc(?!def) will match abc not followed
by def. So it'll match abce, abc,
abck, etc. what if I want neither def
nor xyz will it be abc(?!(def)(xyz))
???
I had the same question and found a solution:
abc(?:(?!def))(?:(?!xyz))
These non-counting groups are combined by "AND", so it this should do the trick. Hope it helps.

Using a regex as you described is the simple way (as far as I am aware). If you want a range you could use [^a-f].

Simplest way is to pull the negation out of the regular expression entirely:
if (!userName.matches("^([Ss]ys)?admin$")) { ... }

Just search for "ab" in the string then negate the result:
!/ab/.test("bamboo"); // true
!/ab/.test("baobab"); // false
It seems easier and should be faster too.

In this case I might just simply avoid regular expressions altogether and go with something like:
if (StringToTest.IndexOf("ab") < 0)
//do stuff
This is likely also going to be much faster (a quick test vs regexes above showed this method to take about 25% of the time of the regex method). In general, if I know the exact string I'm looking for, I've found regexes are overkill. Since you know you don't want "ab", it's a simple matter to test if the string contains that string, without using regex.

The regex [^ab] will match for example 'ab ab ab ab' but not 'ab', because it will match on the string ' a' or 'b '.
What language/scenario do you have? Can you subtract results from the original set, and just match ab?
If you are using GNU grep, and are parsing input, use the '-v' flag to invert your results, returning all non-matches. Other regex tools also have a 'return nonmatch' function, too.
If I understand correctly, you want everything except for those items which contain 'ab' anywhere.

Related

Regex, avoid matching consecutive characters

I m trying to improve my regex skills.
I can't manage this exercise.
https://alf.nu/RegexGolf
You have to match words without consecutive identical characters.
To make it clear, we should avoid patterns like abba, or baab, czzc.
The only way I see is to use capture groups:
([a-z])([a-z])\2\1
Then have a negative lookahead:
(?!([a-z])([a-z])\2\1)
But on the site it doesn't work since it doesn't match anything.
Any advice?
Thank you
Use a negative lookahead:
^(?:(.)(?!\1))*$
Explanation:
^ from the start of the input
(?:
(.) match AND capture a single character
(?!\1) then assert that what follows is a different character (not the same)
)* match zero or more such matching characters
$ end of the input
Demo
Another, possibly cleaner, way to do this would be to just have a global negative lookahead at the very start of the pattern:
^(?!.*(.)\1).*$
This would assert at the very beginning that no character is duplicated, anywhere in the string.
^(?!cr|pal|tar)[a-z]{1,4}([a-z])\1[a-z]{0,5}$
This worked for me in the link you gave. I guess we had to match patterns with consecutive letters. But there were some exceptions for which I had to use negative look ahead at the beginning. I have used ([a-z])\1 to match consecutive characters surrounded by possible characters of possible limit. Hope this helps!
Attached the screenshot for reference.
https://i.stack.imgur.com/va1Uq.png
Thanks to Tim Biegeleisen, here is the answer.
^(?!.*(.)(.)\2\1).*$

Regular expression in Groovy not returning expected results

I've been working on a regular expression with the following requirements.
// Must be exactly 17 characters
// Must only contain letters and numbers
// Cannot contain the letters ‘I’, ‘O’ or ‘Q’
// Must contain at least 1 alpha and 1 numeric character.
Thanks to some help on in another topic I managed to get a regular expression of
/^(?=.*[0-9])(?=.*[a-zA-Z])([a-hj-npr-z0-9]{17})$/
I was able to validate this as per https://regex101.com/r/cVz4b9/4/.
For some reason when I try this in Groovy though I don't get the same results.
def regex = /^(?=.*[0-9])(?=.*[a-zA-Z])([a-hj-npr-z0-9]{17})$/
println​ ('B1cCdDeEfFgGhHwww' ==~ regex)​​​
For example the below Groovy script prints false when I'm expecting true. Perhaps I'm not escaping something I should be? I am using the slashy string so I'm not sure why this would not work?
If anyone can pick out what's wrong that would help me a lot.
thanks
Since \w matches [a-zA-Z_0-9], you can take the following ordered (and concise) approach:
Start with case insensitivity flag: (?i) . Since not revoked,
it "works" till the end of the regex.
Put both positive lookaheads concerning a single digit and letter
placed anywhere: (?=.*[\d])(?=.*[a-z]).
Put negative lookahead concerning 3 "forbidden" chars, but you must
forbid also "_", matched by \w (see below): (?!.*[ioq_]).
Put the main clause concerning 17 word chars: [\w]{17} (instead of
mentioning letters and digits separately, remember that "_" was
forbidden earlier).
^ and $ are not needed, since ==~ checks whether the entire text
is matched by the regex.
To sum up the regex can be: (?i)(?=.*[\d])(?=.*[a-z])(?!.*[ioq_])[\w]{17}.
It seems that case doesn't matter by your example so you could just add the case insensitivity flag (?i)
def regex = /^(?=.*[0-9])(?=.*[a-zA-Z])((?i)[a-hj-npr-z0-9]{17})$/

Extend an regex with logical AND in a non-capturing group

I want to extend an existing regex string:
((?:street)|(?:addr)|(?:straße)|(?:strasse)|(?:adr))
It basically matches strings like street or address.
So now I want to add, that if the strings 'addressAdd' or 'streetnr' exists it doesn't match anything anymore (not even street).
I tried
((?:street)|(?:addr)|(?:straße)|(?:strasse)|(?:adr))(^(?:addressAdd))(^(?:streetnr))
and several variations thereof however didn't succeed. Does anyone of you know how to negate strings?
Update: Some clarification: If a string like addressAdd exists I don't want that any string matches. The java code for this would look like this:
String toCheck="some string to match";
if((!toCheck.equals("streetnr") && !toCheck.equals("addressAdd")) && ( toCheck.equals("street") || toCheck.equals("strasse") || toCheck.equals("adr"))
I'd rather remove unnecessary grouping constructs and add a negative lookahead with these 2 exceptions:
(?!addressAdd|streetnr)(?:street|addr|straße|strasse|adr)
See the regex demo
To match whole words:
\b(?!(?:addressAdd|streetnr)\b)(?:street|addr|straße|strasse|adr)\b
See another demo
Here, you can read more about lookaheads. In short: (?!addressAdd|streetnr) checks if there is no addressAdd and streetnr after the current position and only then the regex engine can go on matching one of the alternatives listed in (?:street|addr|straße|strasse|adr) non-capturing group. With word boundaries (\b(?!(?:addressAdd|streetnr)\b)) only those exceptions are skipped that are whole words (so, if there is streetnrs, it will get matched).
Answer to the update:
To match strings (or lines if DOTALL option is not used) that contain specific substrings and do not contain disallowed whole words, use the negative lookahead at the beginning of the pattern right after ^:
^(?!.*\b(?:addressAdd|streetnr)\b).*(?:street|addr|straße|strasse|adr).*
See another regex demo

Regex negation?

I'm playing Regex Golf (http://regex.alf.nu/) and I'm doing the Abba hole. I have the following regex that matches the wrong side entirely (which is what I was trying to do):
(([\w])([\w])\3\2)
However, I'm trying to negate it now so it matches the other side. I can't seem to figure that part out. I tried:
(?!([\w])([\w])\3\2)
But that didn't work. Any tips from the regex masters?
You can make it much shorter (and get more points) by simply using . and removing unnecessary parens:
^(?!.*(.)(.)\2\1)
It just makes sure that there's no "abba" ("abba" here means 4 letters in that particular order we don't want to match) in any part of the string without having to match the whole word.
Using the explanation here: https://stackoverflow.com/a/406408/584663
I came up with: ^((?!((\w)(\w)\4\3)).)*$
The key here turns out to be the leading caret, ^, and the .*
(?! ...) is a look-ahead construct, and so does not advance the regex processing engine.
/(?! ...)/ on its own will correctly return a negative result for items matching the expression within; but for items which do not match (...) the regex engine continues processing. However if your regex only contains the (?! ) there is nothing left to process, and the regex processing position never advances. (See this great answer).
Apparently since the remaining regex is empty, it matches any zero-width segment of a string, i.e. it matches any string.
[begin SWAG]
With the caret ^ present, the regex engine is able to recognize that you are looking for a real answer and that you do not want it to tell you the string contains zero-width components.
[end SWAG]
Thus it is able to correctly fail to match when the (?! ) succeeds.

Regex to match all permutations of {1,2,3,4} without repetition

I am implementing the following problem in ruby.
Here's the pattern that I want :
1234, 1324, 1432, 1423, 2341 and so on
i.e. the digits in the four digit number should be between [1-4] and should also be non-repetitive.
to make you understand in a simple manner I take a two digit pattern
and the solution should be :
12, 21
i.e. the digits should be either 1 or 2 and should be non-repetitive.
To make sure that they are non-repetitive I want to use $1 for the condition for my second digit but its not working.
Please help me out and thanks in advance.
You can use this (see on rubular.com):
^(?=[1-4]{4}$)(?!.*(.).*\1).*$
The first assertion ensures that it's ^[1-4]{4}$, the second assertion is a negative lookahead that ensures that you can't match .*(.).*\1, i.e. a repeated character. The first assertion is "cheaper", so you want to do that first.
References
regular-expressions.info/Lookarounds and Backreferences
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
Just for a giggle, here's another option:
^(?:1()|2()|3()|4()){4}\1\2\3\4$
As each unique character is consumed, the capturing group following it captures an empty string. The backreferences also try to match empty strings, so if one of them doesn't succeed, it can only mean the associated group didn't participate in the match. And that will only happen if string contains at least one duplicate.
This behavior of empty capturing groups and backreferences is not officially supported in any regex flavor, so caveat emptor. But it works in most of them, including Ruby.
I think this solution is a bit simpler
^(?:([1-4])(?!.*\1)){4}$
See it here on Rubular
^ # matches the start of the string
(?: # open a non capturing group
([1-4]) # The characters that are allowed the found char is captured in group 1
(?!.*\1) # That character is matched only if it does not occur once more
){4} # Defines the amount of characters
$
(?!.*\1) is a lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
While the previous answers solve the problem, they aren't as generic as they could be, and don't allow for repetitions in the initial string. For example, {a,a,b,b,c,c}. After asking a similar question on Perl Monks, the following solution was given by Eily:
^(?:(?!\1)a()|(?!\2)a()|(?!\3)b()|(?!\4)b()|(?!\5)c()|(?!\6)c()){6}$
Similarly, this works for longer "symbols" in a string, and for variable length symbols too.