Vim regex to split string but keep the separators - regex

To my current understanding, the pattern below should work (expected ['bar', 'FOO', 'bar']), but only the first alternative is found (zero-width matches after FOO, but not before).
echo split('barFOObar', '\v(FOO\zs|\zeFOO)') " --> ['barFOO', 'bar']
Netiher could I solve it with lookahead/lookbehind.
echo split('barFOObar', '\v((FOO)\#<=|(FOO)\#=)') " --> ['bar', 'bar']
Compare this with e.g. Python:
echo py3eval("re.split('(?=FOO)|(?<=FOO)', 'barFOObar')") " --> ['bar', 'FOO', 'bar']
(Note: in Python, a paren-enclosed '(FOO)' would also work for this.)
Why don't the above examples in Vim's regex work the way I thought they should? (And also, is there a more or less straightforward way to do this in pure Vimscript then?)

There doesn't seem to be a way to accomplish that direct result using a single split(). In fact, the docs for split() mention this particular situation of preserving the delimiter, with:
If you want to keep the separator you can also use \zs at the end of the pattern:
:echo split('abc:def:ghi', ':\zs')
['abc:', 'def:', 'ghi']
Having said that, using both a lookahead and a lookbehind does actually work. In your example, you have a syntax error. Since you're using verymagic mode, you shouldn't escape #, since it's already special. (Thanks #user938271 for pointing that out!)
This works:
:echo split('barFOObar', '\v((FOO)#<=|(FOO)#=)')
" --> ['bar', 'FOO', 'bar']
Regarding using the markers for \zs and \ze:
:echo split('barFOObar', '\v(FOO\zs|\zeFOO)')
" --> ['barFOO', 'bar']
So the first trouble you have here is that both expressions on each side of the | are matching the same text "FOO", so since they're identical, the first wins and you get it on the left side.
Change order and you get it on the right side:
:echo split('barFOObar', '\v(\zeFOO|FOO\zs)')
" --> ['bar', 'FOObar']
Now the question is why the second token "FOObar" isn't being split since it's matching again (the lookbehind case splits this one, right?)
Well, the answer is that it's actually being split again, but it matches on the first case of \zeFOO one more time and produces a split with the empty string. You can see that by passing a keepempty argument:
:echo split('barFOObar', '\v(\zeFOO|FOO\zs)', 1)
" --> ['bar', '', 'FOObar']
One question still unanswered here is why the lookahead/lookbehind does work, while the \zs and \ze doesn't. I think I addressed that somehow in this answer to regex usage in syntax groups.
This won't work because Vim won't scan the same text twice trying to match a different regex.
Even though the \zs makes the resulting match only include bar, Vim needs to consume FOO to be able to match that regex and it won't do so if it already matched it with the other half of the pattern.
A lookbehind with \#<= is different. The reason it works is that Vim will first search for bar (or whatever text it's considering) and then look behind to see if FOO also matches. So the pattern gets anchored on bar rather than FOO and doesn't suffer from the issue of trying to start a match on a region that already matched another expression.
You can easily visualize that difference by performing a search with Vim. Try this one:
/\v(\zeFOO|FOO\zs)
And compare it with this one:
/\v((FOO)#<=|(FOO)#=)
You'll notice the latter one will match both before and after FOO, while the former won't.
Compare this with e.g. Python [re.split] ...
in Python, a paren-enclosed '(FOO)' would also work for this.
Vim's and Python's regex engines are different beasts...
Many of the limitations in Vim's engine come from its ancestry from vi. One particular limitation is capture groups, where you're limited to 9 of them and there's no way around that.
Given that limitation, you'll find that capture groups are typically used less often (and, when used, they're less powerful) than in Python.
One option to consider is to use Python in Vim instead of Vimscript. Although typically that impacts portability, so personally I wouldn't switch for this feature alone.
is there a more or less straightforward way to do this in pure Vimscript then?
One option is to reimplement a version of split() that preserves delimiters, using matchstrpos(). For example:
function! SplitDelim(expr, pat)
let result = []
let expr = a:expr
while 1
let [w, s, e] = matchstrpos(expr, a:pat)
if s == -1
break
endif
call add(result, s ? expr[:s-1] : '')
call add(result, w)
let expr = expr[e:]
endwhile
call add(result, expr)
return result
endfunction

You could first replace FOO with -FOO-, then split the string. For example:
:echo split(substitute('barFOObarFOObaz', 'FOO','-&-','g'),'-')
['bar', 'FOO', 'bar', 'FOO', 'baz']

Related

Capturing what's inside a nested structure in a regex or grammar token

I'd like to capture the interior of a nested structure.
my $str = "(a)";
say $str ~~ /"(" ~ ")" (\w) /;
say $str ~~ /"(" ~ ")" <(\w)> /;
say $str ~~ /"(" <(~)> ")" \w /;
say $str ~~ /"(" <(~ ")" \w /;
The first one works; the last one works but also captures the closing parenthesis. The other two fail, so it's not possible to use capture markers in this case. But the problem is more complicated in the context of a grammar, since capturing groups do not seem to work either, like here:
# Please paste this together with the code above so that it compiles.
grammar G {
token TOP {
'(' ~ ')' $<content> = .+?
}
}
grammar H {
token TOP {
'(' ~ ')' (.+?)
}
}
grammar I {
token TOP {
'(' ~ ')' <( .+? )>
}
}
$str = "(one of us)";
for G,H,I -> $grammar {
say $grammar.parse( $str );
}
Since neither capturing grouping or capture markers seem to work, except if it's assigned, on the fly, to a variable. This, however, creates an additional token I'd really like to avoid.
So there are two questions
What is the right way to make capture markers work in nested structures?
Is there a way to use either capturing groups or capturing markers in tokens to get the interior of a nested structure?
One solution to two issues
Per ugexe's comment, the [...] grouping construct works for all your use cases.
The <( and )> capture markers are not grouping constructs so they don't work with the regex ~ operation unless they're grouped.
The (...) capture/grouping construct clamps frugal matching to its minimum match when ratchet is in effect. A pattern like :r (.+?) never matches more than one character.
The behaviors described in the last two bullet points above aren't obvious, aren't in the docs, may not be per the design docs, may be holes in roast, may be figments of my imagination, etc. The rest of this answer explains what I've found out about the above three cases, and discusses some things that could be done.
Glib explanation, as if it's all perfectly cromulent
<( and )> are capture markers.
They behave as zero width assertions. Each asserts "this marks where I want capturing to start/end for the regex that contains this marker".
Per the doc for the regex ~ operator:
it mostly ignores the left argument, and operates on the next two [arguments]
(The doc says "atoms" where I've written "arguments". In reality it operates on the next two atoms or groups.)
In the regex pattern "(" ~ ")" <(\w)>:
")" is the first atom/group after ~.
<( is the second atom/group after ~.
~ ignores \w)>.
The solution is to use [...]:
say '(a)' ~~ / '(' ~ ')' [ <( \w )> ] /; # 「a」
Similarly, in a grammar:
token TOP { '(' ~ ')' [ <( .+? )> ] }
(...) grouping isn't what you want for two reasons:
It couldn't be what you want. It would create an additional token capture. And you wrote you'd like to avoid that.
Even if you wanted the additional capture, using (...) when ratchet is in effect clamps frugal matching within the parens.
What could be done about capture markers "not working"?
I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.
Is it known to be intended behavior or a bug?
Searches of GH repos for "capture markers":
raku/old-design-docs
raku/roast
raku/old-issue-tracker and rakudo/rakudo
raku/docs
The term "capture markers" comes from the doc, not the old design docs which just say:
A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint. When matched, these behave as assertions that are always true, but have the side effect of setting the .from and .to attributes of the match object.
(Maybe you can figure out from that what strings to search for among issues etc...)
At the time of writing, all GH searches for <( or )> draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos, eg this.
I was curious and tried this:
my $str = "aaa";
say $str ~~ / <(...)>* /;
It infinitely loops. The * is acting on just the )>. This corroborates the sense that capture markers are treated as atoms.
The regex ~ operator works for [...] and some other grouped atom constructions. Parsing any of them has a start and end within a regex pattern.
The capture markers are different in that they aren't necessarily paired -- the start or end can be implicit.
Perhaps this makes treating them as we might wish unreasonably difficult for Raku given that start (/ or{) and end ( / or }) occur at a slang boundary and Raku is a single-pass parsing braid?
I think that a doc fix is probably the appropriate response to this capture marker aspect of your SO.
If regex ~ were the only regex construct that cared that left and right capture markers are each an individual atom then perhaps the best place to mention this wrinkle would be in the regex ~ section.
But given that multiple regex constructs care (quantifiers do per the above infinite loop example), then perhaps the best place is the capture markers section.
Or perhaps it would be best if it's mentioned in both. (Though that's a slippery slope...)
What could be done about :r (.*?) "not working"?
I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.
Is it known to be intended behavior or a bug?
Searches of GH repos for ratchet frugal:
raku/old-design-docs
raku/roast
raku/old-issue-tracker and rakudo/rakudo
raku/docs
The terms "ratchet" and "frugal" both come from the old design docs and are still used in the latest doc and don't seem to have aliases. So searches for them should hopefully match all relevant mentions.
The above searches are for both words. Searching for one at a time may reveal important relevant mentions that happen to not mention the other.
At the time of writing, all GH searches for .*? or similar draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos.
Perhaps the issue here is broader than the combination of ratchet, frugal, and capture?
Perhaps file an issue using the words "ratchet", "frugal" and "capture"?

Regex: match string unless it contains a word [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

replace word using Regex, but not in Quotes in C# [duplicate]

From this q/a, I deduced that matching all instances of a given regex not inside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"). If there is a way to do it that I don't know about, that would solve my problem.
If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.
Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.
For Example:
An input string of: +bar+baz"not+or\"+or+\"this+"foo+bar+
replacing + with #, not inside quotes, would return: #bar#baz"not+or\"+or+\"this+"foo#bar#
Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.
The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:
\+(?=([^"]*"[^"]*")*[^"]*$)
Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at
\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
I admit it is a little cryptic. =)
Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript and any elegant solutions that would work in most, if not all, cases.
There happens to be a simple, general solution that wasn't mentioned.
Compared with alternatives, the regex for this solution is amazingly simple:
"[^"]+"|(\+)
The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the + that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:
<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
if (!group1) return m;
else return "#";
});
document.write(replaced);
Online demo
You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.
Hope this gives you a different idea of a very general way to do this. :)
What about Empty Strings?
The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from + to *:
"[^"]*"|(\+)
See demo.
What about Escaped Quotes?
Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"| in front of the other two in order to match (and ignore) straggling escaped double quotes.
Next, within the section "[^"]*" that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their " has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"
The resulting expression has three branches:
\\" to match and ignore
"(?:\\"|[^"])*" to match and ignore
(\+) to match, capture and handle
Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.
The full regex becomes:
\\"|"(?:\\"|[^"])*"|(\+)
See regex demo and full script.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
You can do it in three steps.
Use a regex global replace to extract all string body contents into a side-table.
Do your comma translation
Use a regex global replace to swap the string bodies back
Code below
// Step 1
var sideTable = [];
myString = myString.replace(
/"(?:[^"\\]|\\.)*"/g,
function (_) {
var index = sideTable.length;
sideTable[index] = _;
return '"' + index + '"';
});
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
function (_, index) {
return sideTable[index];
});
If you run that after setting
myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';
you should get
{:a "ab,cd, efg"
:b "ab,def, egf,"
:c "Conjecture"}
It works, because after step 1,
myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];
so the only commas in myString are outside strings. Step 2, then turns commas into newlines:
myString = '{:a "0"\n :b "1"\n :c "2"}'
Finally we replace the strings that only contain numbers with their original content.
Although the answer by zx81 seems to be the best performing and clean one, it needes these fixes to correctly catch the escaped quotes:
var subject = '+bar+baz"not+or\\"+or+\\"this+"foo+bar+';
and
var regex = /"(?:[^"\\]|\\.)*"|(\+)/g;
Also the already mentioned "group1 === undefined" or "!group1".
Especially 2. seems important to actually take everything asked in the original question into account.
It should be mentioned though that this method implicitly requires the string to not have escaped quotes outside of unescaped quote pairs.

Search text with a regular expression to match outside specific characters

I have text that looks like:
My name is (Richard) and I cannot do
[whatever (Jack) can't do] and
(Robert) is the same way [unlike
(Betty)] thanks (Jill)
The goal is to search using a regular expression to find all parenthesized names that occur anywhere in the text BUT in-between any brackets.
So in the text above, the result I am looking for is:
Richard
Robert
Jill
You can do it in two steps:
step1: match all bracket contents using:
\[[^\]]*\]
and replace it with ''
step2: match all the remaining parenthesized names(globally) using:
\([^)]*\)
You didn't say what language you're using, so here's some Python:
>>> import re
>>> REGEX = re.compile(r'(?:[^[(]+|\(([^)]*)\)|\[[^]]*])')
>>> s="""My name is (Richard) and I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"""
>>> filter(None, REGEX.findall(s))
The output is:
['Richard', 'Robert', 'Jill']
One caveat is that this does not work with arbitrary nesting. The only nesting it's really designed to work with is one level of parens in square brackets as mentioned in the question. Arbitrary nesting can't be done with just regular expressions. (This is a consequence of the pumping lemma for regular languages.)
The regex looks for chunks of text without brackets or parens, chunks of text enclosed in parens, and chunks of text enclosed in brackets. Only text in parens (not in square brackets) is captured. Python's findall finds all matches of the regex in sequence. In some languages you may need to write a loop to repeatedly match. For non-paren matches, findall inserts an empty string in the result list, so the call to filter removes those.
IF you are using .NET you can do something like:
"(?<!\[.*?)(?<name>\(\w+\))(?>!.*\])"
It's not really the best job for a single regexp - have you considered, for example, making a copy of the string and then deleting everything in between the square brackets instead? It would then be fairly straight forward to extract things from inside the parenthesis. Alternatively, you could write a very basic parser that tokenises the line (into normal text, square bracket text, and parenthesised text, I imagine) and then parses the tree that produces; it'd be more work initially but would make life much simpler if you later want to make the behaviour any more complicated.
Having said that, /(?:(?:^|\])[^\[]*)\((.*?)\)/ does the trick for your test case (but it will almost certainly have some weird behaviour if your [ and ] aren't matched properly, and I'm not convinced it's that efficient).
A quick (PHP) test case:
preg_match_all('/(?:(?:^|\])[^\[]*)\((.*?)\)/', "My name is ... (Jill)", $m);
print(implode(", ", $m[1]));
Outputs:
Richard, Robert, Jill
>>> s="My name is (Richard) and I cannot do [whatever (Jack) can't do (Jill) can] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"
>>> for item in s.split("]"):
... st = item.split("[")[0]
... if ")" in st:
... for i in st.split(")"):
... if "(" in i:
... print i.split("(")[-1]
...
Richard
Robert
Jill
So you want the regex to match the name, but not the enclosing parentheses? This should do it:
[^()]+(?=\)[^\[\]]*(?:\[[^\[\]]*\][^\[\]]*)*$)
As with the other answers, I'm making certain assumptions about your target string, like expecting parentheses and square brackets to be correctly balanced and not nested.
I say it should work because, although I've tested it, I don't know what language/tool you're using to do the regex matching with. We could provide higher-quality answers if we had that info; all regex flavors are not created equal.

How to negate specific word in regex? [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.