RegEx lookahead but not immediately following - regex

I am trying to match terms such as the Dutch ge-berg-te. berg is a noun by itself, and ge...te is a circumfix, i.e. geberg does not exist, nor does bergte. gebergte does. What I want is a RegEx that matches berg or gebergte, working with a lookaround. I was thinking this would work
\b(?i)(ge(?=te))?berg(te)?\b
But it doesn't. I am guessing because a lookahead only checks the immediate following characters, and not across characters. Is there any way to match characters with a lookahead withouth the constraint that those characters have to be immediately behind the others?
Valid matches would be:
Berg
berg
Gebergte
gebergte
Invalid matches could be:
Geberg
geberg
Bergte
bergte
ge-/Ge- and -te always have to occur together. Note that I want to try this with a lookahead. I know it can be done simpler, but I want to see if its methodologically possible to do something like this.

Here is one non-lookaround based regex:
\b(berg|gebergte)\b
Use it with i (ignore case) flag. This regex uses alternation and word boundary to search for complete words berg OR gebergte.
RegEx Demo
Lookaround based regex:
(?<=\bge)berg(?=te\b)|\bberg\b
This regex used a lookahead and lookbehind to search for berg preceded by ge and followed by te. Alternatively it matches complete word berg using word boundary asserter \b which is also 0-width asserter like anchors ^ and $.

To generally forbid a sign, you can put the negative lookaround to the beginning of a string and combine it with random number of other signs before the string you want to forbid:
regex: don't match if containing a specific string
^(?!.\*720).*
This will not match, if the string contains 720, but else match everything else.

Related

How to write regex that excludes the name after Mr. when finding words at the start of a sentence?

I currently have to make a regex that matches the first words at the start of sentences. I've currently done to the point where it matches the first word at the start of the paragraph and the rest, the first words that come after ". The problem that I have here is that 'Sherwood' which is obviously a name, shouldn't be matched but is because it matches the regex which I have written. 'Capital starting letter, comes directly after . and a space'
How can I change my code to exclude the name that comes after Mr. or Dr.?
Current regex: ((^[A-Z]+[a-z]*[A-Z]*[a-z]*|(?<=\")[A-Z]+[a-z]*[A-Z]*[a-z]*)|(?<=\.\s)[A-Z]+[a-z]*[A-Z]*[a-z]*)
I've used regex101.com as a reference.
You could shorten the pattern with the alternations to a single non capture group containing the 3 patterns that are allowed for the start of the string.
As you are already using a lookbehind assertion, you can exclude Mr. of Dr. to the left using a negative lookbehind:
(?:^|(?<=")|(?<=\.\s))(?<![MD]r\. )[A-Z]+[a-z]*[A-Z]*[a-z]*
Regex demo
You might also first match an uppercase char, and then do the assertions to prevent the alternation with the lookbehind assertions to fire on every position when there is no match.
[A-Z](?<=".|\.\s.|^.)(?<![MD]r\. .)(?:[A-Z]*[a-z]*){2}
Regex demo

Regexp. How to match word isn't followed and preceded by another characters

I want to replace mm units to cm units in my code. In the case of the big amount of such replacements I use regexp.
I made such expression:
(?!a-zA-Z)mm(?!a-zA-Z)
But it still matches words like summa, gamma and dummy.
How to make up regexp correctly?
Use character classes and change the first (?!...) lookahead into a lookbehind:
(?<![a-zA-Z])mm(?![a-zA-Z])
^^^^^^^^^^^^^ ^^^^^^^^^^^
See the regex demo
The pattern matches:
(?<![a-zA-Z]) - a negative lookbehind that fails the match if there is an ASCII letter immediately to the left of the current location
mm - a literal substring
(?![a-zA-Z]) - a negative lookahead that fails the match if there is an ASCII letter immediately to the right of the current location
NOTE: If you need to make your pattern Unicode-aware, replace [a-zA-Z] with [^\W\d_] (and use re.U flag if you are using Python 2.x).
There's no need to use lookaheads and lookbehinds, so if you wish to simplify your pattern you can try something like this;
\d+\s?(mm)\b
This does assume that your millimetre symbol will always follow a number, with an optional space in-between, which I think that in this case is a reasonable assumption.
The \b checks for a word boundary to make sure the mm is not part of a word such as dummy etc.
Demo here

Regex Negative Lookbehind Matches Lookbehind text .NET

Say I have the following strings:
PB-GD2185-11652-MTCH
GD2185-11652-MTCH
KD-GD2185-11652-MTCH
KD-GD2185-11652
I want REGEX.IsMatch to return true if the string has MTCH in it and does not start with PB.
I expected the regex to be the following:
^(?<!PB)\S+(?=MTCH)
but that gives me the following matches:
PB-GD2185-11652-
GD2185-11652-
KD-GD2185-11652-
I do not understand why the negative lookbehind not only doesn't exclude the match but includes the PB characters in the match. The positive lookahead works as expected.
EDIT 1
Let me start with a simpler example. The following regex matches all of the strings as I would expect it to:
\S+
The following regex still matches all of the strings even though I would expect it not to:
\S+(?!MTCH)
The following regex matches all but the final H character on the first three strings:
\S+(?<!MTCH)
From the documentation at regex 101, a lookahead looks for text to the right of the pattern and a lookbehind looks for text to the left of the pattern, so having a lookahead at the beginning of a string does not jive with the documentation.
Edit 2
take another example with the following three strings:
grey
greyhound
hound
the regex:
^(?<!grey)hound
only matches the final hound. whereas the regex:
^(?<!grey)\S+
matches all three.
You need a lookahead: ^(?!PB)\S+(?=MTCH). Using the look-behind means the PB has to come before the first character.
The problem was because of the greediness of \S+. When dealing with lookarounds and greedy quantifiers you can easily match more characters than you expect. One way to deal with this is to insert a negative lookaround in a group with the greedy quantifier to exclude it as a match as stated in this question:
How to non-greedy multiple lookbehind matches
and on this helpful website about greediness in regular expressions:
http://www.rexegg.com/regex-quantifiers.html
Note that this second link has a few other ways to deal with the greediness in various situations.
A good regular expression for this situation is as follows:
^(?<!PB)((?!PB)\S+)(MTCH)
In situations like this it is going to be much clearer to do it logically within the code. So first check if the string matches MTCH and then that it doesn't match ^PB

How to match all strings other than a particular one

To match all characters except vowels, we can use [^aeiou].
I wonder
how to match all strings other than a particular one? For example, I want to match a string which is not dog. So cat, sky, and mike will all be matches.
how to match all strings other than a few strings, or other than a regular expression?
For example, I want to match a string which is not c.t. So sky and mike will all be matches, but cat and cut will not be matches.
Thanks.
1. How to match all strings other than a particular one
^(?!your_string$).*$
2. How to match all strings other than a few strings
^(?!(?:string1|string2|string3)$).*$
How does that work?
The idea is to use a negative lookahead (?! to check that the string does not consists solely of the string(s) to avoid. If the negative lookahead (which is an assertion) succeeds, the .*$ matches everything to the end of the string.
Note the use of the ^ anchor at the beginning to ensure we are positioned at the beginning of the string.
Note the $ anchor inside the negative lookahead to ensure that we are excluding your_string if it is indeed the whole string, but that we do not exclude your_string and more
Reference
Mastering Lookahead and Lookbehind
Negative Lookaheads

How to negate the whole regex?

I have a regex, for example (ma|(t){1}). It matches ma and t and doesn't match bla.
I want to negate the regex, thus it must match bla and not ma and t, by adding something to this regex. I know I can write bla, the actual regex is however more complex.
Use negative lookaround: (?!pattern)
Positive lookarounds can be used to assert that a pattern matches. Negative lookarounds is the opposite: it's used to assert that a pattern DOES NOT match. Some flavor supports assertions; some puts limitations on lookbehind, etc.
Links to regular-expressions.info
Lookahead and Lookbehind Zero-Width Assertions
Flavor comparison
See also
How do I convert CamelCase into human-readable names in Java?
Regex for all strings not containing a string?
A regex to match a substring that isn’t followed by a certain other substring.
More examples
These are attempts to come up with regex solutions to toy problems as exercises; they should be educational if you're trying to learn the various ways you can use lookarounds (nesting them, using them to capture, etc):
codingBat plusOut using regex
codingBat repeatEnd using regex
codingbat wordEnds using regex
Assuming you only want to disallow strings that match the regex completely (i.e., mmbla is okay, but mm isn't), this is what you want:
^(?!(?:m{2}|t)$).*$
(?!(?:m{2}|t)$) is a negative lookahead; it says "starting from the current position, the next few characters are not mm or t, followed by the end of the string." The start anchor (^) at the beginning ensures that the lookahead is applied at the beginning of the string. If that succeeds, the .* goes ahead and consumes the string.
FYI, if you're using Java's matches() method, you don't really need the the ^ and the final $, but they don't do any harm. The $ inside the lookahead is required, though.
\b(?=\w)(?!(ma|(t){1}))\b(\w*)
this is for the given regex.
the \b is to find word boundary.
the positive look ahead (?=\w) is here to avoid spaces.
the negative look ahead over the original regex is to prevent matches of it.
and finally the (\w*) is to catch all the words that are left.
the group that will hold the words is group 3.
the simple (?!pattern) will not work as any sub-string will match
the simple ^(?!(?:m{2}|t)$).*$ will not work as it's granularity is full lines
This regexp math your condition:
^.*(?<!ma|t)$
Look at how it works:
https://regex101.com/r/Ryg2FX/1
Apply this if you use laravel.
Laravel has a not_regex where field under validation must not match the given regular expression; uses the PHP preg_match function internally.
'email' => 'not_regex:/^.+$/i'