How negative lookahead works

How negative lookahead works - regex

I want to match a string not containing word "the"
The following solution looks logical to me:
^(?!.*the.*).*$
The following one (I've came across on SO) also works but I cannot understand WHY it works
^((?!the).)*$
In my view (?!the). should match a)ANY b)single character then repeatd by *, so the regex should match any string?
There is the great site I'm using for reference http://www.rexegg.com but no such example there

It's basically doing a match-any-character, and search for the string literal "the" in every position. If found, the negation cancels the match.
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character)
( # Match the regular expression below and capture its match into backreference number 1
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
the # Match the characters “the” literally
)
. # Match any single character that is not a line break character
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ # Assert position at the end of a line (at the end of the string or before a line break character)

The above solution works but only if you also want to match strings not containing words with the characters the in them -- e.g., I was going there would be excluded. You need word boundaries if you want to match everything not containing the word the:
^((?!\bthe\b).)*$
or:
^(?!.*\bthe\b).*$

^((?!the).)*$
This will check at every point before consuming if there is the ahead of it.So in a string abcthe after c regex engine will see the and it will fail.But because you have ^$ anchors because the the engine could not make a complete match it will fail and not match anything.If you remove $ it will match upto abc.

Related

Removing last character from a line using regex

I just started learning regex and I'm trying to understand how it possible to do the following:
If I have:
helmut_rankl:20Suzuki12
helmut1195:wasserfall1974
helmut1951:roller11
Get:
helmut_rankl:20Suzuki1
helmut1195:wasserfall197
helmut1951:roller1
I tried using .$ which actually match the last character of a string, but it doesn't match letters and numbers.
How do I get these results from the input?

You could match the whole line, and assert a single char to the right if you want to match at least a single character.
.+(?=.)
Regex demo
If you also want to match empty strings:
.*(?=.)

This will do what you want with regex's match function.
^(.*).$
Broken down:
^ matches the start of the string
( and ) denote a capturing group. The matches which fall within it are returned.
.* matches everything, as much as it can.
The final . matches any single character (i.e. the last character of the line)
$ matches the end of the line/input

regex for a whole word containing dots within a sentence

I am looking for a regular expression to catch a whole word or expression within a sentence that contains dots:
this is an example test.abc.123 for what I am looking for
In this case i want to catch "test.abc.123"
I tried with this regex:
(.*)(\b.+\..++\b)(.*)
(.*) some signs or not
(\b.+\..++\b) a word containing some signs followed by at least on dot that is followed by some signs and this at least once
(.*) some more signs nor not#
but it gets me: "abc.123 for what I am looking for"
I see that I got something completely wrong, can anyone enlighten me?

If you need to match part of a string you don't need to match entire string (unless you are restricted by a functionality).
Your regex is so greedy. It also has dots every where (.+ is not a good choice most of the time). It doesn't have a precise point to start and finish either. You only need:
\w+(?:\.+\w+)+
It looks for strings that begin and end with word characters and contain at least a period. See live demo here

This regex pattern matches strings with two or more dots:
.*\..*\..*
"." matches any character except line-breaks
"*" repeats previous tokens 0 or more times
"." matches a single dot, slash is used for escape
.* Match any character and continue matching until next token
test.abc.123
(.) Match a single dot
test. abc.123
.* Again, any character and continue matching until next token
test.example.com
. Matches a single dot
test.example. com
.* Matches any character and continue matching until next token
test.example.com

Try this pattern: (?=\w+\.{1,})[^ ]+.
Details: (?=\w+\.{1,}) - positive lookahead to locate starting of a word with at least one dot (.). Then, start matching from that position, until space with this pattern [^ ]+.
Demo

Perl: Matching string not containing PATTERN

While using Perl regex to chop a string down into usable pieces I had the need to match everything except a certain pattern. I solved it after I found this hint on Perl Monks:
/^(?:(?!PATTERN).)*$/; # Matches strings not containing PATTERN
Although I solved my initial problem, I have little clue about how it actually works. I checked perlre, but it is a bit too formal to grasp.
Regular expression to match a line that doesn't contain a word? helps a lot in understanding, but why is the . in my example and the ?: and how do the outer parentheses work?
Can someone break up the regex and explain in simple words how it works?

Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):
This matches any string:
/^.*$/
But we don't want . to match a character that starts PATTERN, so replace
.
with
(?!PATTERN).
This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:
if PATTERN doesn't match at this point,
match the next character
This needs to be done for every character in the string, so * is used to match zero or more times, from the beginning to the end of the string.
To make the * apply to the combination of the negative look-ahead and ., not just the ., it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: ):
(?:(?!PATTERN).)*
And putting back the anchors to make sure we test at every position in the string:
/^(?:(?!PATTERN).)*$/
Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo and later baz but no bar in between:
/foo(?:(?!bar).)*baz/
If there aren't such considerations, you can simply do:
/^(?!.*PATTERN)/
to check that PATTERN does not match anywhere in the string.
About newlines: there are two problems with your regex and newlines. First, . doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/ doesn't match, even though the string does not contain baz. You need to add the /s flag to make . match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s correctly matches. Second, $ doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s correctly fails to match the string that does in fact contain a \s. So the correct general purpose regex is:
/^(?:(?!PATTERN).)*\z/s

jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.
For instance, here is the RegexBuddy output:
"
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?: # Match the regular expression below
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
PATTERN # Match the character string “PATTERN” literally (case insensitive)
)
. # Match any single character that is NOT a line break character (line feed)
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
# Perl 5.18 allows a zero-length match at the position where the previous match ends.
# Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"
Some people also use regex101.
A Human Explanation
Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^ and the $, implying that the only possible match is the whole string, not a substring of that string.
Then we come to the meat: a non-capturing group introduced by (?: and repeated any number of times by the *
What does this group do? It contains
a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
then a dot to match the next character
This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.
If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.

regex to match a word with unique (non-repeating) characters

I'm looking for a regex that will match a word only if all its characters are unique, meaning, every character in the word appears only once.
Example:
abcdefg -> will return MATCH
abcdefgbh -> will return NO MATCH (because the letter b repeats more than once)

Try this, it might work,
^(?:([A-Za-z])(?!.*\1))*$
Explanation
Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match the regular expression below «(?:([A-Z])(?!.*\1))*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below and capture its match into backreference number 1 «([A-Z])»
Match a single character in the range between “A” and “Z” «[A-Z]»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\1)»
Match any single character that is not a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the same text as most recently matched by capturing group number 1 «\1»
Assert position at the end of a line (at the end of the string or before a line break character) «$»

You can check whether there are 2 instances of the character in the string:
^.*(.).*\1.*$
(I just simply capture one of the character and check whether it has a copy elsewhere with back reference. The rest of .* are don't-cares).
If the regex above match, then the string has repeating character. If the regex above doesn't match, then all the characters are unique.
The good thing about the regex above is when the regex engine doesn't support look around.
Apparently John Woo's solution is a beautiful way to check for the uniqueness directly. It assert at every character that the string ahead will not contain the current character.

This one would also provide a full match to any length word with non-repeating letters:
^(?!.*(.).*\1)[a-z]+$
I slightly revised the answer provided by #Bohemian to another question a while ago to get this.
It has also been a while since the question above has been asked but I thought it would be nice to also have this regex pattern here.

Regular expression doesn't match if a character participated in a previous match

I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!

In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++

You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.

You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How negative lookahead works - regex

The above solution works but only if you also want to match strings not containing words with the characters the in them -- e.g., I was going there would be excluded. You need word boundaries if you want to match everything not containing the word the: ^((?!\bthe\b).)$ or: ^(?!.\bthe\b).*$

Related

Removing last character from a line using regex

regex for a whole word containing dots within a sentence

Perl: Matching string not containing PATTERN

regex to match a word with unique (non-repeating) characters

Regular expression doesn't match if a character participated in a previous match

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How negative lookahead works - regex

The above solution works but only if you also want to match strings not containing words with the characters the in them -- e.g., I was going there would be excluded. You need word boundaries if you want to match everything not containing the word the: ^((?!\bthe\b).)*$ or: ^(?!.*\bthe\b).*$

Related

Removing last character from a line using regex

regex for a whole word containing dots within a sentence

Perl: Matching string not containing PATTERN

regex to match a word with unique (non-repeating) characters

Regular expression doesn't match if a character participated in a previous match

Categories

Resources

The above solution works but only if you also want to match strings not containing words with the characters the in them -- e.g., I was going there would be excluded. You need word boundaries if you want to match everything not containing the word the: ^((?!\bthe\b).)$ or: ^(?!.\bthe\b).*$