regex match substring unless another substring matches

regex match substring unless another substring matches - regex

I'm trying to dig deeper into regexes and want to match a condition unless some substring is also found in the same string. I know I can use two grepl statements (as seen below) but am wanting to use a single regex to test for this condition as I'm pushing my understanding. Let's say I want to match the words "dog" and "man" using "(dog.*man|man.*dog)" (taken from here) but not if the string contains the substring "park". I figured I could use (*SKIP)(*FAIL) to negate the "park" but this does not cause the string to fail (shown below).
How can I match the logic of find "dog" & "man" but not "park" with 1 regex?
What is wrong with my understanding of (*SKIP)(*FAIL)|?
The code:
x <- c(
"The dog and the man play in the park.",
"The man plays with the dog.",
"That is the man's hat.",
"Man I love that dog!",
"I'm dog tired",
"The dog park is no place for man.",
"Park next to this dog's man."
)
# Could do this but want one regex
grepl("(dog.*man|man.*dog)", x, ignore.case=TRUE) & !grepl("park", x, ignore.case=TRUE)
# Thought this would work, it does not
grepl("park(*SKIP)(*FAIL)|(dog.*man|man.*dog)", x, ignore.case=TRUE, perl=TRUE)

You can use the anchored look-ahead solution (requiring Perl-style regexp):
grepl("^(?!.*park)(?=.*dog.*man|.*man.*dog)", x, ignore.case=TRUE, perl=T)
Here is an IDEONE demo
^ - anchors the pattern at the start of the string
(?!.*park) - fail the match if park is present
(?=.*dog.*man|.*man.*dog) - fail the match if man and dog are absent.
Another version (more scalable) with 3 look-aheads:
^(?!.*park)(?=.*dog)(?=.*man)

stribizhev has already answered this question as it should be approached: with a negative lookahead.
I'll contribute to this particular question:
What is wrong with my understanding of (*SKIP)(*FAIL)?
(*SKIP) and (*FAIL) are regex control verbs.
(*FAIL) or (*F)
This is the easiest to understand. (*FAIL) is exactly the same as a negative lookahead with an empty subpattern: (?!). As soon as the regex engine gets to that verb in the pattern it forces an immediate backtrack.
(*SKIP)
When the regex engine first encounters this verb, nothing happens, because it only acts when it's reached on backtracking. But if there is a later failure, and it reaches (*SKIP) from right to left, the backtracking can't pass (*SKIP). It causes:
A match failure.
The next match won't be attempted from the next character. Instead, it will start from the position in the text where the engine was when it reached (*SKIP).
That is why these two control verbs are usually together as (*SKIP)(*FAIL)
Let's consider the following example:
Pattern: .*park(*SKIP)(*FAIL)|.*dog
Subject: "That park has too many dogs"
Matches: " has too many dog"
Internals:
First attempt.
That park has too many dogs || .*park(*SKIP)(*FAIL)|.*dog
/\ /\
(here) we have a match for park
the engine passes (*SKIP) -no action
it then encounters (*FAIL) -backtrack
Now it reaches (*SKIP) from the right -FAIL!
Second attempt.
Normally, it should start from the second character in the subject. However, (*SKIP) has this particular behaviour. The 2nd attempt starts:
That park has too many dogs || .*park(*SKIP)(*FAIL)|.*dog
/\ /\
(here)
Now, there's no match for .*park
And off course it matches .*dog
That park has too many dogs || .*park(*SKIP)(*FAIL)|.*dog
^ ^ -----
| (MATCH!) |
+---------------+
DEMO
How can I match the logic of find "dog" & "man" but not "park" with 1 regex?
Use stribizhev's solution!! Try to avoid using control verbs for the sake of compatibility, they're not implemented in all regex flavours. But if you're interested in these regex oddities, there's another stronger control verb: (*COMMIT). It is similar to (*SKIP), acting only while on backtracking, except it causes the entire match to fail (there won't be any other attempt at all). For example:
+-----------------------------------------------+
|Pattern: |
|^.*park(*COMMIT)(*FAIL)|dog |
+-------------------------------------+---------+
|Subject | Matches |
+-----------------------------------------------+
|The dog and the man play in the park.| FALSE |
|Man I love that dog! | TRUE |
|I'm dog tired | TRUE |
|The dog park is no place for man. | FALSE |
|park next to this dog's man. | FALSE |
+-------------------------------------+---------+
IDEONE demo

Related

Regex to match all the words looking for [duplicate]

I have found very similar posts, but I can't quite get my regular expression right here.
I am trying to write a regular expression which returns a string which is between two other strings. For example: I want to get the string which resides between the strings "cow" and "milk".
My cow always gives milk
would return
"always gives"
Here is the expression I have pieced together so far:
(?=cow).*(?=milk)
However, this returns the string "cow always gives".

A lookahead (that (?= part) does not consume any input. It is a zero-width assertion (as are boundary checks and lookbehinds).
You want a regular match here, to consume the cow portion. To capture the portion in between, you use a capturing group (just put the portion of pattern you want to capture inside parenthesis):
cow(.*)milk
No lookaheads are needed at all.

Regular expression to get a string between two strings in JavaScript
The most complete solution that will work in the vast majority of cases is using a capturing group with a lazy dot matching pattern. However, a dot . in JavaScript regex does not match line break characters, so, what will work in 100% cases is a [^] or [\s\S]/[\d\D]/[\w\W] constructs.
ECMAScript 2018 and newer compatible solution
In JavaScript environments supporting ECMAScript 2018, s modifier allows . to match any char including line break chars, and the regex engine supports lookbehinds of variable length. So, you may use a regex like
var result = s.match(/(?<=cow\s+).*?(?=\s+milk)/gs); // Returns multiple matches if any
// Or
var result = s.match(/(?<=cow\s*).*?(?=\s*milk)/gs); // Same but whitespaces are optional
In both cases, the current position is checked for cow with any 1/0 or more whitespaces after cow, then any 0+ chars as few as possible are matched and consumed (=added to the match value), and then milk is checked for (with any 1/0 or more whitespaces before this substring).
Scenario 1: Single-line input
This and all other scenarios below are supported by all JavaScript environments. See usage examples at the bottom of the answer.
cow (.*?) milk
cow is found first, then a space, then any 0+ chars other than line break chars, as few as possible as *? is a lazy quantifier, are captured into Group 1 and then a space with milk must follow (and those are matched and consumed, too).
Scenario 2: Multiline input
cow ([\s\S]*?) milk
Here, cow and a space are matched first, then any 0+ chars as few as possible are matched and captured into Group 1, and then a space with milk are matched.
Scenario 3: Overlapping matches
If you have a string like >>>15 text>>>67 text2>>> and you need to get 2 matches in-between >>>+number+whitespace and >>>, you can't use />>>\d+\s(.*?)>>>/g as this will only find 1 match due to the fact the >>> before 67 is already consumed upon finding the first match. You may use a positive lookahead to check for the text presence without actually "gobbling" it (i.e. appending to the match):
/>>>\d+\s(.*?)(?=>>>)/g
See the online regex demo yielding text1 and text2 as Group 1 contents found.
Also see How to get all possible overlapping matches for a string.
Performance considerations
Lazy dot matching pattern (.*?) inside regex patterns may slow down script execution if very long input is given. In many cases, unroll-the-loop technique helps to a greater extent. Trying to grab all between cow and milk from "Their\ncow\ngives\nmore\nmilk", we see that we just need to match all lines that do not start with milk, thus, instead of cow\n([\s\S]*?)\nmilk we can use:
/cow\n(.*(?:\n(?!milk$).*)*)\nmilk/gm
See the regex demo (if there can be \r\n, use /cow\r?\n(.*(?:\r?\n(?!milk$).*)*)\r?\nmilk/gm). With this small test string, the performance gain is negligible, but with very large text, you will feel the difference (especially if the lines are long and line breaks are not very numerous).
Sample regex usage in JavaScript:
//Single/First match expected: use no global modifier and access match[1]
console.log("My cow always gives milk".match(/cow (.*?) milk/)[1]);
// Multiple matches: get multiple matches with a global modifier and
// trim the results if length of leading/trailing delimiters is known
var s = "My cow always gives milk, thier cow also gives milk";
console.log(s.match(/cow (.*?) milk/g).map(function(x) {return x.substr(4,x.length-9);}));
//or use RegExp#exec inside a loop to collect all the Group 1 contents
var result = [], m, rx = /cow (.*?) milk/g;
while ((m=rx.exec(s)) !== null) {
result.push(m[1]);
}
console.log(result);
Using the modern String#matchAll method
const s = "My cow always gives milk, thier cow also gives milk";
const matches = s.matchAll(/cow (.*?) milk/g);
console.log(Array.from(matches, x => x[1]));

Here's a regex which will grab what's between cow and milk (without leading/trailing space):
srctext = "My cow always gives milk.";
var re = /(.*cow\s+)(.*)(\s+milk.*)/;
var newtext = srctext.replace(re, "$2");
An example: http://jsfiddle.net/entropo/tkP74/

You need capture the .*
You can (but don't have to) make the .* nongreedy
There's really no need for the lookahead.
> /cow(.*?)milk/i.exec('My cow always gives milk');
["cow always gives milk", " always gives "]

The chosen answer didn't work for me...hmm...
Just add space after cow and/or before milk to trim spaces from " always gives "
/(?<=cow ).*(?= milk)/

I find regex to be tedious and time consuming given the syntax. Since you are already using javascript it is easier to do the following without regex:
const text = 'My cow always gives milk'
const start = `cow`;
const end = `milk`;
const middleText = text.split(start)[1].split(end)[0]
console.log(middleText) // prints "always gives"

You can use the method match() to extract a substring between two strings. Try the following code:
var str = "My cow always gives milk";
var subStr = str.match("cow(.*)milk");
console.log(subStr[1]);
Output:
always gives
See a complete example here : How to find sub-string between two strings.

I was able to get what I needed using Martinho Fernandes' solution below. The code is:
var test = "My cow always gives milk";
var testRE = test.match("cow(.*)milk");
alert(testRE[1]);
You'll notice that I am alerting the testRE variable as an array. This is because testRE is returning as an array, for some reason. The output from:
My cow always gives milk
Changes into:
always gives

Just use the following regular expression:
(?<=My cow\s).*?(?=\smilk)

If the data is on multiple lines then you may have to use the following,
/My cow ([\s\S]*)milk/gm
My cow always gives
milk
Regex 101 example

You can use destructuring to only focus on the part of your interest.
So you can do:
let str = "My cow always gives milk";
let [, result] = str.match(/\bcow\s+(.*?)\s+milk\b/) || [];
console.log(result);
In this way you ignore the first part (the complete match) and only get the capture group's match. The addition of || [] may be interesting if you are not sure there will be a match at all. In that case match would return null which cannot be destructured, and so we return [] instead in that case, and then result will be null.
The additional \b ensures the surrounding words "cow" and "milk" are really separate words (e.g. not "milky"). Also \s+ is needed to avoid that the match includes some outer spacing.

The method match() searches a string for a match and returns an Array object.
// Original string
var str = "My cow always gives milk";
// Using index [0] would return<br/>
// "**cow always gives milk**"
str.match(/cow(.*)milk/)**[0]**
// Using index **[1]** would return
// "**always gives**"
str.match(/cow(.*)milk/)[1]

Task
Extract substring between two string (excluding this two strings)
Solution
let allText = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum";
let textBefore = "five centuries,";
let textAfter = "electronic typesetting";
var regExp = new RegExp(`(?<=${textBefore}\\s)(.+?)(?=\\s+${textAfter})`, "g");
var results = regExp.exec(allText);
if (results && results.length > 1) {
console.log(results[0]);
}

regex doesn't match last word

I have this simple regex:
RegEx_Seek_1 := TDIPerlRegEx.Create{$IFNDEF DI_No_RegEx_Component}(nil){$ENDIF};
s1 := '(doesn''t|don''t|can''t|cannot|shouldn''t|wouldn''t|couldn''t|havn''t|hadn't)';
// s1 contents this text: (doesn't|don't|can't|cannot|shouldn't|wouldn't|couldn't|havn't|hadn't)
RegEx_Seek_1.MatchPattern := '(*UCP)(?m)'+s1+' (a |the )(ear|law also|multitude|son)(?(?= of)( \* | \w+ )| )([^»Ô¶ ][^ »Ô¶]\w*)';
Which is targeted on finding noun with an article, which can be followed by of. If there is of, then I need to search for noun \w+ (and \* too; substitude for verb). The last word should be verb.
The sample text:
. some text . Doesn't the ear try ...
. some text doesn't the law also say ...
. some text doesn't the son bear ...
. some text . Shouldn't the multitude of words be answered? ...
. some text . Why doesn't the son of * come to eat ...
My results:
Doesn't the ear try
doesn't the law also say
doesn't the son bear
Shouldn't the multitude of words
And it fails to get the last sentence:
doesn't the son of * come
My plan is to add \K before the last word to get the verb.
The exclusion of the characters:
[^»Ô¶] is made because », Ô, ¶ already represent some mark in the text, to decribe a existing verb. They may or may be not present. I am using spaces. Tabs are delimitors and are not part of any sentence.
In this regex I included a space [^»Ô¶ ] to get the last word.
So the question is how to correct the regex to get one more line:
doesn't the son of * come
Edit:
I need to refer the verbs in the same group while replacing (I will refer to verb).

Your mistake is in (?(?= of)( \* | \w+ )| ).
Remember that lookaheads don't move the cursor forward, so the ( \* | \w+ ) will match of , so the remainder is now * come which can't be matched by ([^»Ô¶ ][^ »Ô¶]\w*) as the second character is a space.
I guess you should match the of already in your condition, like (?(?= of) of( \* | \w+ )| )

I modified the Wiktor's pattern to match:
(*UCP)(?m)'+s1+' (a |the )(ear|law also|multitude|son)(?:\s+of Words|\s+of \*)*\s+\K(?P<verb>[^\s»Ô¶]+)
Now I can refer to the last group like this:
char(182)+'$<verb>'
I show my results how the verb was changed using Replace2 function of TDIRegEx. You see it works:
Why doesn't the son of * ¶come to eat
Doesn't the ear ¶try words,
Why doesn't the son ¶bear the
doesn't the law also ¶say the same thing?
Shouldn't the multitude of words ¶be answered?
Both answers, the one from Wictor and the one from Sebastian helped me to solve the question. Thank you.

Regex Checking if String Contains Two or More Instances of Words from a Set

I'm trying to write a regular expression to see if there are 2 or more of any words in a set in a given string.
If the set is [cat, dog] then:
"cat in the hat" - false
"cat and dog" - true
"cat and cat" - true
I tried these, but they don't work correctly:
\bcat\b|\bdog\b{2,}
(\bcat\b|\bdog\b){2,}
is this query possible with regex?

Option 1: Pure Regex
(?:.*(?:\b(?:cat|dog)\b)){2}
If there is a match, it is True that two or more of the words are present.
If you want to be a purist about a regex that in itself constitutes a Boolean assertion (no matching of characters), we can wrap this in a lookahead:
^(?=(?:.*(?:\b(?:cat|dog)\b)){2})
Option 2: Count Matches
If you are using a programming language, this pseudocode:
WordsRegex = \b(?:cat|dog)\b
MatchCount = count matches(WordsRegex, string)
TwoOrMore = ( MatchCount > 1)

Do you want a simple true/false result for the match, or would you like to actually capture the words that match?
Some regex languages, like PCRE, allow for "pattern repeat" with the (?[some number]) format:
(?=(cat|dog).*(?1))
This looks for either cat or dog and then (due to the (?1)) looks for cat or dog again. Example 1.
If you wish to capture the pattern (either the entire thing, or the individual words), you can use one of:
((cat|dog).*((?2)))
Example 2
or
(?:(cat|dog).*((?1)))
Example 3
Example 2 captures the entire group in the \1 reference, with the captured words in \2 and \3, respectively.
Example 3 doesn't capture the entire group, but it does capture the words in \1 and \2, respectively.
Other languages (Javascript, Python) may handle this differently, so you may not have access to the (?1) reference.

How to delete a pattern when it is not found between two symbols in Perl?

I have a document like this:
Once upon a time, there lived a cat.
The AAAAAA cat was ZZZZZZ very happy.
The AAAAAAcatZZZZZZ knew many other cats from many AAAAAA cities ZZZZZZ.
The cat knew brown cats and AAAAAA green catsZZZZZZ and red cats.
The AAAAAA and ZZZZZZ are similar to { and }, but are used to avoid problems with other scripts that might interpret { and } as other meanings.
I need to delete all appearances of "cat" when it is not found between an AAAAAA and ZZZZZZ.
Once upon a time, there lived a .
The AAAAAA cat was ZZZZZZ very happy.
The AAAAAAcatZZZZZZ knew many other s from many AAAAAA cities ZZZZZZ.
The knew brown s and AAAAAA green catsZZZZZZ and red s.
All AAAAAA's have a matching ZZZZZZ.
The AAAAAA's and matching ZZZZZZ's are not split across lines.
The AAAAAA's and matching ZZZZZZ's are never nested.
The pattern, "cat" in the example above, is not treated as a word. This could be anything.
I have tried several things, e.g.:
perl -pe 's/[^AAAAAAA](.*)(cat)(.*)[^BBBBBBB]//g' <<< "AAAAAAA cat 1 BBBBBBB cat 2"
How can I delete any pattern when it is not found between some matching set of symbols?

You have several possible ways:
You can use the \K feature to remove the part you don't want from match result:
s/AAAAAA.*?ZZZZZZ\K|cat//gs
(\K removes all on the left from match result, but all characters on left are consumed by the regex engine. Consequence, when the first part of the alternation succeeds, you replace the empty string (immediatly after ZZZZZZ) with an empty string.)
You can use a capturing group to inject as it (with a reference $1) the substring you want to preserve in the replacement string:
s/(AAAAAA.*?ZZZZZZ)|cat/$1/gs
You can use backtracking control verbs to skip and not retry the substring matched:
s/AAAAAA.*?ZZZZZZ(*SKIP)(*FAIL)|cat//gs
((*SKIP) forces the regex engine to not retry the substring found on the left if the pattern fails later. (*FAIL) forces the pattern to fail.)
Note: if AAAAAA and ZZZZZZ must be always on the same line, you can remove the /s modifier and process the data line by line.

Regex: how to determine odd/even number of occurrences of a char preceding a given char?

I would like to replace the | with OR only in unquoted terms, eg:
"this | that" | "the | other" -> "this | that" OR "the | other"
Yes, I could split on space or quote, get an array and iterate through it, and reconstruct the string, but that seems ... inelegant. So perhaps there's a regex way to do this by counting "s preceding | and obviously odd means the | is quoted and even means unquoted. (Note: Processing doesn't start until there is an even number of " if there is at least one ").

It's true that regexes can't count, but they can be used to determine whether there's an odd or even number of something. The trick in this case is to examine the quotation marks after the pipe, not before it.
str = str.replace(/\|(?=(?:(?:[^"]*"){2})*[^"]*$)/g, "OR");
Breaking that down, (?:[^"]*"){2} matches the next pair of quotes if there is one, along with the intervening non-quotes. After you've done that as many times as possible (which might be zero), [^"]*$ consumes any remaining non-quotes until the end of the string.
Of course, this assumes the text is well-formed. It doesn't address the problem of escaped quotes either, but it can if you need it to.

Regexes do not count. That's what parsers are for.

You might find the Perl FAQ on this issue relevant.
#!/usr/bin/perl
use strict;
use warnings;
my $x = qq{"this | that" | "the | other"};
print join('" OR "', split /" \| "/, $x), "\n";

You don't need to count, because you don't nest quotes. This will do:
#!/usr/bin/perl
my $str = '" this \" | that" | "the | other" | "still | something | else"';
print "$str\n";
while($str =~ /^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/) {
$str =~ s/^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/$1OR/;
}
print "$str\n";
Now, let's explain that expression.
^ -- means you'll always match everything from the beginning of the string, otherwise
the match might start inside a quote, and break everything
(...)\| -- this means you'll match a certain pattern, followed by a |, which appears
escaped here; so when you replace it with $1OR, you keep everything, but
replace the |.
(?:...)* -- This is a non-matching group, which can be repeated multiple times; we
use a group here so we can repeat multiple times alternative patterns.
[^"|\\]* -- This is the first pattern. Anything that isn't a pipe, an escape character
or a quote.
\\. -- This is the second pattern. Basically, an escape character and anything
that follows it.
"(?:...)*" -- This is the third pattern. Open quote, followed by a another
non-matching group repeated multiple times, followed by a closing
quote.
[^\\"] -- This is the first pattern in the second non-matching group. It's anything
except an escape character or a quote.
\\. -- This is the second pattern in the second non-matching group. It's an
escape character and whatever follows it.
The result is as follow:
" this \" | that" | "the | other" | "still | something | else"
" this \" | that" OR "the | other" OR "still | something | else"

Another approach (similar to Alan M's working answer):
str = str.replace(/(".+?"|\w+)\s*\|\s*/g, '$1 OR ');
The part inside the first group (spaced for readability):
".+?" | \w+
... basically means, something quoted, or a word. The remainder means that it was followed by a "|" wrapped in optional whitespace. The replacement is that first part ("$1" means the first group) followed by " OR ".

Perhaps you're looking for something like this:
(?<=^([^"]*"[^"]*")+[^"|]*)\|

Thanks everyone. Apologies for neglecting to mention this is in javascript and that terms don't have to be quoted, and there can be any number of quoted/unquoted terms, eg:
"this | that" | "the | other" | yet | another -> "this | that" OR "the | other" OR yet OR another
Daniel, it seems that's in the ballpark, ie basically a matching/massaging loop. Thanks for the detailed explanation. In js, it looks like a split, a forEach loop on the array of terms, pushing a term (after changing a | term to OR) back into an array, and a re join.

#Alan M, works nicely, escaping not necessary due to the sparseness of sqlite FTS capabilities.
#epost, accepted solution for brevity and elegance, thanks. it needed to merely be put in a more general form for unicode etc.
(".+?"|[^\"\s]+)\s*\|\s*

My solution in C# to count the quotes and then regex to get the matches:
// Count the number of quotes.
var quotesOnly = Regex.Replace(searchText, #"[^""]", string.Empty);
var quoteCount = quotesOnly.Length;
if (quoteCount > 0)
{
// If the quote count is an odd number there's a missing quote.
// Assume a quote is missing from the end - executive decision.
if (quoteCount%2 == 1)
{
searchText += #"""";
}
// Get the matching groups of strings. Exclude the quotes themselves.
// e.g. The following line:
// "this and that" or then and "this or other"
// will result in the following groups:
// 1. "this and that"
// 2. "or"
// 3. "then"
// 4. "and"
// 5. "this or other"
var matches = Regex.Matches(searchText, #"([^\""]*)", RegexOptions.Singleline);
var list = new List<string>();
foreach (var match in matches.Cast<Match>())
{
var value = match.Groups[0].Value.Trim();
if (!string.IsNullOrEmpty(value))
{
list.Add(value);
}
}
// TODO: Do something with the list of strings.
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex match substring unless another substring matches - regex

Related

Regex to match all the words looking for [duplicate]

regex doesn't match last word

Regex Checking if String Contains Two or More Instances of Words from a Set

How to delete a pattern when it is not found between two symbols in Perl?

Regex: how to determine odd/even number of occurrences of a char preceding a given char?

Categories

Resources