Having trouble deciphering a sentence tokenizer regex

Having trouble deciphering a sentence tokenizer regex - regex

The following regex is suppose to act as a sentence tokenizer pattern, but I'm having some trouble deciphering what exactly it's doing:
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s
I understand that it's using positive and negative lookbehinds, as the accepted answer of this post explains (they give the example of a negative lookbehind like this: (?<!B)A). But what is considered A in the above regex?

The regex is checking for breaks between sentences. The negative lookbehinds prevent false matches that represent abbreviations instead of the ends of sentences. They mean:
(?<!\w\.\w.) Don't match anything that looks like A.b., 2.c., or 1.3. (Probably they meant for the second period to also be \. to match only a period, but as written it will match any character at the end, for example A.b! or g.Z4)
(?<![A-Z][a-z]\.) Don't match anything that looks like Cf., Dr., Mr., etc. Note this only checks two characters, so "Mrs." will be matched incorrectly.
(?<![A-Z]\.) Don't match anything that looks like A. or C.
Then if these all pass, it has a positive lookbehind (?<=\.|\?|\!) to check for ., ? or !.
And finally it matches on any whitespace \s.
Demo

Related

Regex Negative Lookbehind Matches Lookbehind text .NET

Say I have the following strings:
PB-GD2185-11652-MTCH
GD2185-11652-MTCH
KD-GD2185-11652-MTCH
KD-GD2185-11652
I want REGEX.IsMatch to return true if the string has MTCH in it and does not start with PB.
I expected the regex to be the following:
^(?<!PB)\S+(?=MTCH)
but that gives me the following matches:
PB-GD2185-11652-
GD2185-11652-
KD-GD2185-11652-
I do not understand why the negative lookbehind not only doesn't exclude the match but includes the PB characters in the match. The positive lookahead works as expected.
EDIT 1
Let me start with a simpler example. The following regex matches all of the strings as I would expect it to:
\S+
The following regex still matches all of the strings even though I would expect it not to:
\S+(?!MTCH)
The following regex matches all but the final H character on the first three strings:
\S+(?<!MTCH)
From the documentation at regex 101, a lookahead looks for text to the right of the pattern and a lookbehind looks for text to the left of the pattern, so having a lookahead at the beginning of a string does not jive with the documentation.
Edit 2
take another example with the following three strings:
grey
greyhound
hound
the regex:
^(?<!grey)hound
only matches the final hound. whereas the regex:
^(?<!grey)\S+
matches all three.

You need a lookahead: ^(?!PB)\S+(?=MTCH). Using the look-behind means the PB has to come before the first character.

The problem was because of the greediness of \S+. When dealing with lookarounds and greedy quantifiers you can easily match more characters than you expect. One way to deal with this is to insert a negative lookaround in a group with the greedy quantifier to exclude it as a match as stated in this question:
How to non-greedy multiple lookbehind matches
and on this helpful website about greediness in regular expressions:
http://www.rexegg.com/regex-quantifiers.html
Note that this second link has a few other ways to deal with the greediness in various situations.
A good regular expression for this situation is as follows:
^(?<!PB)((?!PB)\S+)(MTCH)

In situations like this it is going to be much clearer to do it logically within the code. So first check if the string matches MTCH and then that it doesn't match ^PB

RegEx lookahead but not immediately following

I am trying to match terms such as the Dutch ge-berg-te. berg is a noun by itself, and ge...te is a circumfix, i.e. geberg does not exist, nor does bergte. gebergte does. What I want is a RegEx that matches berg or gebergte, working with a lookaround. I was thinking this would work
\b(?i)(ge(?=te))?berg(te)?\b
But it doesn't. I am guessing because a lookahead only checks the immediate following characters, and not across characters. Is there any way to match characters with a lookahead withouth the constraint that those characters have to be immediately behind the others?
Valid matches would be:
Berg
berg
Gebergte
gebergte
Invalid matches could be:
Geberg
geberg
Bergte
bergte
ge-/Ge- and -te always have to occur together. Note that I want to try this with a lookahead. I know it can be done simpler, but I want to see if its methodologically possible to do something like this.

Here is one non-lookaround based regex:
\b(berg|gebergte)\b
Use it with i (ignore case) flag. This regex uses alternation and word boundary to search for complete words berg OR gebergte.
RegEx Demo
Lookaround based regex:
(?<=\bge)berg(?=te\b)|\bberg\b
This regex used a lookahead and lookbehind to search for berg preceded by ge and followed by te. Alternatively it matches complete word berg using word boundary asserter \b which is also 0-width asserter like anchors ^ and $.

To generally forbid a sign, you can put the negative lookaround to the beginning of a string and combine it with random number of other signs before the string you want to forbid:
regex: don't match if containing a specific string
^(?!.\*720).*
This will not match, if the string contains 720, but else match everything else.

Regex to match certain word but not a particular combination

I have 15 titles as follows:
fruits-and-flowers-themeA
fruits-and-flowers-themeB
fruits-and-flowers-just-test-themeA
themeAfruitsandflowers
nice-fruits-and-flowers-themeA
botanical-names-themeA
I want a regex to help me get only those titles with "themeA" in them, but it should not include "nice" and not include "just-test" or "just-tests".
I tried
^(?!.*just-test|*just-tests|nice).*?(?:themeA).*,
but I still get fruits-and-flowers-just-test-themeA in the output.
How to fix this?
Thanks

You can use this regex with negative lookahead:
^(?!.*?(?:just-tests?|nice)).*?themeA.*$
Working Demo

Option 1
You can use a single regex with lookaheads (see online demo):
^(?!.*nice?)(?!.*just-tests?).*themeA.*
The ^ asserts that the match starts at the beginning of the string (so we don't match a subset of the string
The (?!.*nice?) is a negative lookahead that asserts that at this position in the string, we cannot find any characters followed by nice
The (?!.*just-tests?) is a negative lookahead that asserts that at this position in the string, we cannot find any characters followed by just-test and an optional s
As a further tweak, you can compress the lookaheads into one using an | alternation as in anubhava's answer.
Option 2 without lookaheads (Perl, PHP/PCRE)
^(?:.*(?:nice|just-tests?).*)(*SKIP)(?!)|.*themeA.*
This one doesn't use lookaheads but just skips the unwanted titles. See demo.

Use two different regular expressions for clarity and simplicity.
Match your string against one regex that matches themeA:
/themeA/
and then check that the string does NOT match the one you don't want:
/nice|just-tests?/
Doing it in two different regexes makes it far easier to understand and maintain.

Regular Expression: match only non-repeated occurrence of a character

I need to find and replace all occurrences of apostrophe character in a string, but only if this apostrophe is not followed by another apostrophe.
That is
abc'def
is a match but
abc''def
is NOT a match.
I've already composed a working pattern - (^|[^'])'($|[^']) but I believe it may be shorter and simpler.
Thanks,
Valery

depends on your environment - if your environment supports lookahead and lookbehind, you can do this: (?<!')'(?!')
Ref: http://www.regular-expressions.info/lookaround.html

I think your pattern is short and precise. You could be using negative lookahead/lookbehind, but they would make it a lot more complex. Maintainability is important.

You'll have to be careful for an uneven number of apostrophes:
abc'''def
where you probably do want to replace the 3rd one and leave the 1st and 2nd in there.
You can do that like this (assuming you already matched string literals and only want to replace the uneven numbered trailing apostrophe):
Search for the pattern:
(('')*)'
and replace it with
$1
which is group 1: the even numbered apostrophes (or no apostrophes at all).
I'm not sure what actual problem you're solving, but in case you're parsing/reading a CSV file, or a string that has the likes of CSV input, I highly recommend using a decent CSV parser. Almost all languages have them in some form or another.

see here nagative lookahed q(?!u)
(?=pattern) is a positive look-ahead assertion
(?!pattern) is a negative look-ahead assertion
(?<=pattern) is a positive look-behind assertion
(?<!pattern) is a negative look-behind assertion
http://www.regular-expressions.info/lookaround.html
working DEMO

How to negate the whole regex?

I have a regex, for example (ma|(t){1}). It matches ma and t and doesn't match bla.
I want to negate the regex, thus it must match bla and not ma and t, by adding something to this regex. I know I can write bla, the actual regex is however more complex.

Use negative lookaround: (?!pattern)
Positive lookarounds can be used to assert that a pattern matches. Negative lookarounds is the opposite: it's used to assert that a pattern DOES NOT match. Some flavor supports assertions; some puts limitations on lookbehind, etc.
Links to regular-expressions.info
Lookahead and Lookbehind Zero-Width Assertions
Flavor comparison
See also
How do I convert CamelCase into human-readable names in Java?
Regex for all strings not containing a string?
A regex to match a substring that isn’t followed by a certain other substring.
More examples
These are attempts to come up with regex solutions to toy problems as exercises; they should be educational if you're trying to learn the various ways you can use lookarounds (nesting them, using them to capture, etc):
codingBat plusOut using regex
codingBat repeatEnd using regex
codingbat wordEnds using regex

Assuming you only want to disallow strings that match the regex completely (i.e., mmbla is okay, but mm isn't), this is what you want:
^(?!(?:m{2}|t)$).*$
(?!(?:m{2}|t)$) is a negative lookahead; it says "starting from the current position, the next few characters are not mm or t, followed by the end of the string." The start anchor (^) at the beginning ensures that the lookahead is applied at the beginning of the string. If that succeeds, the .* goes ahead and consumes the string.
FYI, if you're using Java's matches() method, you don't really need the the ^ and the final $, but they don't do any harm. The $ inside the lookahead is required, though.

\b(?=\w)(?!(ma|(t){1}))\b(\w*)
this is for the given regex.
the \b is to find word boundary.
the positive look ahead (?=\w) is here to avoid spaces.
the negative look ahead over the original regex is to prevent matches of it.
and finally the (\w*) is to catch all the words that are left.
the group that will hold the words is group 3.
the simple (?!pattern) will not work as any sub-string will match
the simple ^(?!(?:m{2}|t)$).*$ will not work as it's granularity is full lines

This regexp math your condition:
^.*(?<!ma|t)$
Look at how it works:
https://regex101.com/r/Ryg2FX/1

Apply this if you use laravel.
Laravel has a not_regex where field under validation must not match the given regular expression; uses the PHP preg_match function internally.
'email' => 'not_regex:/^.+$/i'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Having trouble deciphering a sentence tokenizer regex - regex

Related

Regex Negative Lookbehind Matches Lookbehind text .NET

RegEx lookahead but not immediately following

Regex to match certain word but not a particular combination

Regular Expression: match only non-repeated occurrence of a character

How to negate the whole regex?

Categories

Resources