Regex select all BUT group - regex

So I'm in a situation where I must use only regex to select everything but a specific word. For the purposes of example, the word will be foobar. This is an example of what should happen:
this should be highlighted, and
same with this. but any sentence
that has the word
foobar
shouldnt be, and same for any regular
sentence with foobar <-- like that
foobar beginning a sentence should invalidate
the entire sentence, same with at the end foobar
only foobar, and nothing else of the sentence
more words here more irrelevant stuff to highlight
and nothing of the key word
what about multiple foobar on the same foobar line?
And what should be matched, would look something like this:
The best I could get is /\b(?!foobar)[^\n]+\n?/g which works if the word foobar is alone on it's own separate line formatted like this:
not foobar
foobar (ignored)
totallynotfoobar
nobar
foobutts
foobar (ignored)
notagain
And the rest is matched... but this is not what I want.
So my question is, how would I accomplish the original example? Is it even possible?

Here's one way: (demo)
\W*\b(?!foobar).+?\b\W*
The ? in .+? is to ensure we stop matching as soon as we get a \b, otherwise we might skip over some foobar's.
The \W*'s are necessary to consume any leading or trailing non-word characters in the string.
Every word and every word separator are matched separately here, which might not be ideal.
Full explanation:
NODE EXPLANATION
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
foobar 'foobar'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.+? any character except \n (1 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the most
amount possible))
A variation with look-behind and look-ahead: (with /gs or /gm) (demo)
(?<=^|\bfoobar\b)(?!foobar\b)(.*?)(?=\bfoobar\b|$)
I believe all those \b's are necessary to correctly handle all cases where foobar appears as part of a word (if it as part of a word should also be excluded, just removing all \b's should work).

Related

Ruby regex counting characters

I am trying to create a regex in ruby that matches against strings with 10 characters which are not special characters i.e. would match with \w.
So far I have come up with this:
/\w{10,}/
but the issue is that it will only count a consecutive sequence of word characters. I want to match any string which counts up to have at least 10 "word" characters. Is this possible? I am fairly new to regex as a whole so any help would be appreciated.
If I understood correctly, this should work:
/(?:\w[^\w]*){9,}\w/
Explanation:
We start with a single
\w
We want to capture all the other characters until another \w, hence:
\w[^\w]*
[^<list of chars>] matches any character other than listed in the brackets, so [^\w] means any character that is not a word character. * denotes 0 or more. The above will match "a-- ", "b" and "c!" in "a-- bc!" string.
Since we need 10 \w, we will match 9 (or more) groups like that, followed by a single \w
(\w[^\w]*){9,}\w
We don't really care for captures here (especially since ruby will ignore repeated group captures anyway, so we make the group non-capturing)
(?:\w[^\w]*){9,}\w
Alternatively we could just use simpler regex:
(?:\w[^\w]*){10,}
But it will also cover characters after the last word character in a string - not sure if this is required here.
Match anywhere in the string:
/\w(?:\W*\w){9,19}/
/(?:\W*\w){10,20}/
Validate a string of 10 to 20 characters long:
/\A(?:\W*\w){10,20}\W*\z/
Prefer non-capturing groups, particularly when extracting found matches.
Watch out for ^ and $ that mark up start and end of the line respectively in Ruby's regex.
EXPLANATION
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
(?: group, but do not capture (between 10 and
20 times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
){10,20} end of grouping
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\z the end of the string

How can I make this only match the words after the word 'speaks' and ignore commas and spaces

I have the following RegEX:
/(?<=\w\sspeaks\s)(?!,|\s|\.)([\w]+)/gmi
The string is:
Example Person speaks ExampleLanguage1, ExampleLanguage2, ExampleLanguage3 and ExampleLanguage4.
Example Person two speaks ExampleLanguage1, ExampleLanguage2, ExampleLanguage3 and ExampleLanguage4.
Example Person three speaks ExampleLanguage1 and ExampleLanguage2.
For me, the above only matches:
ExampleLanguage1
ExampleLanguage1
ExampleLanguage1
I want to match:
ExampleLanguage1
ExampleLanguage2
ExampleLanguage3
ExampleLanguage4
ExampleLanguage1
ExampleLanguage2
ExampleLanguage3
ExampleLanguage4
ExampleLanguage1
ExampleLanguage2
The words Example Person can be any word, even without space in-between.
And the words ExampleLanguage do not have numbers marked. And they also can have spaces, and can be any word.
Here is a link to it: https://regex101.com/r/MjL8cW/1
If you can make use of the \G anchor, you might match 4 or more word characters, or match words with 1-3 characters and use \K to clear the match buffer.
(?:^.*?\hspeaks|\w{1,3}|\G(?!^))[,.]?\h+\K\w{4,}
The pattern matches:
(?: Non capture group
^.*?\hspeaks Match From the start of the string till the first occurrence of a whitespace char and speaks.
| Or
\w{1,3} Match 1-3 word chars
| Or
\G(?!^) Assert the position at the end of the previous match, but not at the start
) Close non capture group
[,.]?\h+ Match an optional , or . and 1 or more horizontal whitespace chars
\K\w{4,} Forget what is matched until so far using \K and match 4 or more word chars
Regex demo
Use
(?<=\bspeaks\b.*?)\b\w{4,}\b
See proof.
Explanation
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
speaks 'speaks'
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\w{4,} word characters (a-z, A-Z, 0-9, _) (at
least 4 times (matching the most amount
possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
Demo
const string = `Example Person speaks ExampleLanguage1, ExampleLanguage2, ExampleLanguage3 and ExampleLanguage4.
Example Person two speaks ExampleLanguage1, ExampleLanguage2, ExampleLanguage3 and ExampleLanguage4.
Example Person three speaks ExampleLanguage1 and ExampleLanguage2.`
console.log(string.match(/(?<=\bspeaks\b.*?)\b\w{4,}\b/gi))
The continue operator seems to be the right thing here. The accepted is fine but there is a problem with 3 letter languages, like Yao, Min, Mon (spoken in Afrika, Asia...)
Try something along this lines:
(?i)(?:speaks\s*\K|(?<!^)\G(?:,|,?\s*and)\s*\K)(?-i)([A-Z](?i)\w+)
Demo

Match if the line has two or more of the same capitalized word

Basically I want to match this:
So this. So that. [this should match]
Yes this. No that. [this shouldn't match]
I thought this would work:
(\b(\w+)\1\b.*){2,}
But right now, it's matching the second line too: https://regexr.com/5jhag
Why is this and how to fix it?
Match if the line has two or more of the same capitalized word
As you want to match capitalized words only a \w is not right because it matches [a-zA-Z0-9_] characters. Also using \1 just after the capture group means consecutive repeats only. Finally \b is also required around matches.
You may use this regex:
\b([A-Z]\w*)\b.*\b\1\b
RegEx Demo
RegEx Details:
\b: Word boundary
([A-Z]\w*): Match a capitalize word that start with uppercase letter followed by 0 or more of any word characters
\b: Word boundary
.*: Match 0 or more of any characters
\b\1\b: Match same word as what we captured in group #1 surrounded with word boundaries
(\b(\w+)\1\b.*){2,} is a repeated capturing group. \1 is a backreference that references the value of the group it is defined in and it is always assigned an empty string, at each iteration. Note: if you were to test with PCRE engine, there would be no match, see proof, because \1 is not empty, it is null and there is no match.
Your regex matches Yes this. No that. because the current expression is equal to (\b(\w+)\b.*){2,} and matches any word, then any text, two times or more.
Use
.*\b([A-Z][a-zA-Z]+)\b.*\b\1\b.*
See proof.
Unicode version:
.*\b(\p{Lu}\p{L}+)\b.*\b\1\b.*
See another proof.
Explanation
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
[a-zA-Z]+ any character of: 'a' to 'z', 'A' to 'Z'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))

Regex multiple lookbehinds order

I've got a regex that I want to use to match word characters after a - if there are not any proceeding word characters.
(?<!\w)(?<=-)\w+
For the string
I want -to match a word-if it has a '-' bofore it and only-if '-' is not preceded by a word character.
I would expect it to only match to. However, it actually matches to, if, and if.
Demo
If I take the positive lookbehind out
(?<!\w)-\w+
In the same string, it only matches -to as expected but I don't want the - in the match information.
Is it possible to chain positive and negative lookbehinds so they happen in order?
The pattern that you tried (?<!\w)(?<=-)\w+ makes 2 assertions the current position:
(?<!\w) is there not a word character directly to the left
(?<=-) is there a - directly to the left
This can also be written as just (?<=-)\w+ as the positive lookbehind asserts that the exact match should be at the left.
You get the matches to, if, and if because that assertion is true at multiple places.
You can use (?<=\W-) to assert what is directly to the left is a non word character \W followed by -
(?<=\W-)\w+
Regex demo
Use
(?<=\B-)\w+
See proof
Explanation
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\B the boundary between two word chars (\w)
or two non-word chars (\W)
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))

ReGex, How to find second instance of string

If I want to get the Name between “for” and “;” which is NISHER HOSE, can you help me find the correct regex expression as there is more than one "for’ and “;” in the string
Data Owner Approval Needed for Access Request #: 2137352 for NISHER HOSE; CONTRACTOR; Manager: MUILLER, TIM (TWM0069)
Using the regular expression (?<=for).*(?=;) I get the wrong match Access Request #: 2137352 for NISHER HOSE; CONTRACTOR - see screenshot on https://www.regextester.com/
Thanks
If you only want to assert for on the left, you should and make sure to not match for again and you should exclude matching a ; while asserting there is one at the right.
(?<=\bfor )(?:(?!\bfor\b)[^;])+(?=;)
Explanation
(?<=\bfor ) Assert for at the left
(?:(?!\bfor\b)[^;])! Match 1+ times any char except ; if from the current position not directly followed by for surrounded by word boundaries
(?=;) Assert ; directly at the right
Regex demo
Use
(?<=\bfor )(?![^;]*\bfor\b)[^;]+
See proof.
Explanation
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
for 'for '
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[^;]* any character except: ';' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
for 'for'
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[^;]+ any character except: ';' (1 or more times
(matching the most amount possible))
The main issue here is that there are two "for". If you want to catch the name then use the ":" as a delimiter to catch the second "for":
Regex: /:.*for(.+?);/gm
Demo: https://regex101.com/r/p3QY0o/1
The name will be captured in group 1. If you decide to use a lookahead/lookbehind just bear in mind that these may or may not be supported depending on the regex engine.