I'm trying to create a regex to check the number of unique users.
In this case, 3 different users in 1 string means it's valid.
Let's say we have the following string
lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven
It contains the domain for each user (lab) and their first name.
Each user is seperated by ;
The goal is to have 3 unique users in a string.
In this case, the string is valid because we have the following unique users
simon, lieven, tim, davy = valid
If we take this string
lab\simon;lab\lieven;lab\simon
It's invalid because we only have 2 unique users
simon, lieven = invalid
So far, I've only come up with the following regex but I don't know how to continue
/(lab)\\(?:[a-zA-Z]*)/g
Could you help me with this regex?
Please let me know if you need more information if it's not clear.
What you are after cannot be achieved through regular expressions on their own. Regular expressions are to be used for parsing information and not processing.
There is no particular pattern you are after, which is what regular expression excel at. You will need to split by ; and use a data structure such as a set to store you string values.
Is this what you want:
1) Using regular expression:
import re
s = r'lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven'
pattern = re.compile(r'lab\\([A-z]{1,})')
user = re.findall(pattern, s)
if len(user) == len(set(user)) and len(user) >= 3:
print('Valid')
else:
print('Invalid')
2) Without using regular expression:
s = r'lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven'
users = [i.split('\\')[-1] for i in s.split(';')]
if len(users) == len(set(users)) and len(users) >= 3:
print('Valid')
else:
print('Invalid')
In order to have a successful match, we need at least 3 sets of lab\user, i.e:
(?:\\?lab\\[\w]+(?:;|$)){3}
You didn't specify your engine but with pythonyou can use:
import re
if re.search(r"(?:\\?lab\\[\w]+(?:;|$)){3}", string):
# Successful match
else:
# Match attempt failed
Regex Demo
Regex Explanation
(?:\\?lab\\[\w]+(?:;|$)){3}
Match the regular expression «(?:\\?lab\\[\w]+(?:;|$)){3}»
Exactly 3 times «{3}»
Match the backslash character «\\?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character string “lab” literally «lab»
Match the backslash character «\\»
Match a single character that is a “word character” «[\w]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:;|$)»
Match this alternative «;»
Match the character “;” literally «;»
Or match this alternative «$»
Assert position at the end of a line «$»
Here is a beginner-friendly way to solve your problem:
You should .split() the string per each "lab" section and declare the result as the array variable, like splitted_string.
Declare a second empty array to save each unique name, like unique_names.
Use a for loop to iterate through the splitted_string array. Check for unique names: if it isn't in your array of unique_names, add the name to unique_names.
Find the length of your array of unique_names to see if it is equal to 3. If yes, print that it is. If not, then print a fail message.
You seem like a practical person that is relatively new to string manipulation. Maybe you would enjoy some practical background reading on string manipulation at beginner sites like Automate The Boring Stuff With Python:
https://automatetheboringstuff.com/chapter6/
Or Codecademy, etc.
Another pure regex answer for the sport. As other said, you should probably not be doing this
^([^;]+)(;\1)*;((?!\1)[^;]+)(;(\1|\3))*;((?!\1|\3)[^;]+)
Explanation :
^ from the start of the string
([^;]+) we catch everything that isn't a ';'.
that's our first user, and our first capturing group
(;\1)* it could be repeated
;((?!\1)[^;]+) but at some point, we want to capture everything that isn't either
our first user nor a ';'. That's our second user,
and our third capturing group
(;(\1|\3))* both the first and second user can be repeated now
;((?!\1|\3)[^;]+) but at some point, we want to capture yada yada,
our third user and fifth capturing group
This can be done with a simple regex.
Uses a conditional for each user name slot so that the required
three names are obtained.
Note that since the three slots are in a loop, the conditional guarantees the
capture group is not overwritten (which would invalidate the below mentioned
assertion test (?! \1 | \2 | \3 ).
There is a complication. Each user name uses the same regex [a-zA-Z]+
so to accommodate that, a function is defined to check that the slot
has not been matched before.
This is using the boost engine, that cosmetically requires the group be
defined before it is back referenced.
The workaround is to define a function at the bottom after the group is defined.
In PERL (and some other engines) it is not required to define a group ahead
of time before its back referenced, so you could do away with the function
and put
(?! \1 | \2 | \3 ) # Cannot have seen this user
[a-zA-Z]+
in the capture groups on top.
At a minimum, this requires conditionals.
Formatted and tested:
# (?:(?:.*?\blab\\(?:((?(1)(?!))(?&GetUser))|((?(2)(?!))(?&GetUser))|((?(3)(?!))(?&GetUser))))){3}(?(DEFINE)(?<GetUser>(?!\1|\2|\3)[a-zA-Z]+))
# Look for 3 unique users
(?:
(?:
.*?
\b lab \\
(?:
( # (1), User 1
(?(1) (?!) )
(?&GetUser)
)
| ( # (2), User 2
(?(2) (?!) )
(?&GetUser)
)
| ( # (3), User 3
(?(3) (?!) )
(?&GetUser)
)
)
)
){3}
(?(DEFINE)
(?<GetUser> # (4)
(?! \1 | \2 | \3 ) # Cannot have seen this user
[a-zA-Z]+
)
)
Related
I have 3 values that I'm trying to match. foo, bar and 123. However I would like to match them only if they can be matched twice.
In the following line:
foo;bar;123;foo;123;
since bar is not present twice, it would only match:
foo;bar;123;foo;123;
I understand how to specify to match exactly two matches, (foo|bar|123){2} however I need to use backreferences in order to make it work in my example.
I'm struggling putting the two concepts together and making a working solution for this.
You could use
(?<=^|;)([^\n;]+)(?=.*(?:(?<=^|;)\1(?=;|$)))
Broken down, this is
(?<=^|;) # pos. loobehind, either start of string or ;
([^\n;]+) # not ; nor newline 1+ times
(?=.* # pos. lookahead
(?:
(?<=^|;) # same pattern as above
\1 # group 1
(?=;|$) # end or ;
)
)
\b # word boundary
([^;]+) # anything not ; 1+ times
\b # another word boundary
(?=.*\1) # pos. lookahead, making sure the pattern is found again
See a demo on regex101.com.
Otherwise - as said in the comments - split on the ; programmatically and use some programming logic afterwards.
Find a demo in Python for example (can be adjusted for other languages as well):
from collections import Counter
string = """
foo;bar;123;foo;123;
foo;bar;foo;bar;
foo;foo;foo;bar;bar;
"""
twins = [element
for line in string.split("\n")
for element, times in Counter(line.split(";")).most_common()
if times == 2]
print(twins)
making sure to allow room for text that may occur in between matches with a ".*", this should match any of your values that occur at least twice:
(foo|bar|123).*\1
I have this string:
I have an eraser and 2 pencils.
Jane has a ruler and a stapler.
I need to get all the items that I have (lines starting with I have). I have tried these expressions:
(?:I have|and)\h+((?:a|an|\d+)\h+(?:\w+))
# returns some of the items that Jane has.
(I have )(?(1)((?:a|an|\d+) \w+))
# returns only the word closest to the beginning of the string.
I'm looking for a way to match a given string/expression at the beginning of the line or somewhere before the capturing group. Thanks in advance.
Note: I'm working with PCRE
It's still tricky do have a variable number of groups, but you can try this:
I have (?:an |a )?(\d? ?\w+)(\(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?
Below are some sample results:
"I have an eraser and a pencil and an item" -> ["eraser", "pencil", "item"]
"She has a turtle and a car" -> []
"I have 3 bricks and 4 knees and a tie" -> ["3 bricks", "4 knees", "tie"]
"I have a motorcycle and a bag" -> ["motorcycle", "bag"]
"I have a journal" -> ["journal"]
"I have wires and tires" -> ["wires", "tires"]
"I must say I have a train and a bicycle" -> ["train", "bicycle"]
For each line, it will capture a maximum number of 3 items.
This is a typical case for anchoring at the end of previous match with \G.
We're trying to match some text followed by an unknown number of tokens, and it needs to capture each token individually. The regex engine is totally capable of repeating a construct to match repeating token, but each backreference must be defined on its own. Therefore, repeating a capturing group ends up overwriting its stored value and returning only the last matched value. This task may be achieved by 2 different strategies: either capturing all tokens with 1 pattern and then using a second pattern match to split them, or performing one full match for each token.
Instead of trying to get all the items "I have" in the same match, we're going to attempt to match once per item. This approach was also tried with some of the patterns proposed in the comments. However, as you may have realized, the regex engine also matches from the middle of the string, and thus matching unwanted cases like:
She has >>a turtle<< ...
This is where we can use an anchor like \G. Our strategy will be:
Match ^I have and capture 1 item (the match ends here).
In consecutive match, start at the end of previous match, and match 1 item.
Repeat (2) for successive matches.
Now, this can be translated to regex:
^I have an? + the token
Literal text at the beggining of the line.
an or a.
And we'll cover the the token construct later.
\G(?!^)(?: and)? an? + the token
\G matches a zero-width position at the end of previous match. This is how the regex engine won't attempt a match anywhere in the string.
However, \G also matches at the beggining of the string, and we don't want to match the string "an item...". There's a trick: we used the negative lookahead (?!^) to specify "it's not followed by the start of the text". Therefore, it's guaranteed to match where it left off from the previous match in (1).
(?: and)? is optional, so it may or may not be there.
an? matches the article (an/a).
Do you see that both end up with the same construct? if we join the 2 options together:
(?:^I have:?|(?!^)\G(?: and)?) an? <<the token>>
Let's talk about the token. If it were only one word, we'd use \w+. That's not the case. Neither is .* because it shouldn't match the whole string. And we can't consume part of the following token because otherwise it wouldn't be returned in the next match.
I have a new eraser and a pencil
^
|
How does it stop here?!
How do we define a condition not to allow a match beyond that position?
It's not followed by a/an/and !!!
This can be achieved by a negative lookahead, to guarantee it's not followed by a/an/and before we match a character: (?! a | an | and ).. As you can imagine, that construct will be repeated to match every one of the characters in a token.
This pattern matches what we want: (?:(?! and | an? ).)+
And one more thing, we'll use a capturing group around it to be able to extract the text.
the token = ((?:(?! and | an? ).)+)
First version:
We now have the first working version of the regex. Put together:
(?:^I have:?|(?!^)\G(?: and)?) an? ((?:(?! and | an? ).)+)
Test it in regex101
A few more tricks:
Following the same principle, this approach allows us to include more conditions to the match. For instance,
Not anchored to the start of line.
Without capturing groups, returning each token by with the value of the full match.
Items can be separated with commas.
"I have" could be followed by any word, not necessarily an article, using exceptions.
etc.
What to choose depends on the subjet text, and it should be tested with several examples and corrected until it works as desired.
Solution:
This is the pattern I'd personally use in this case:
(?: # SUBPATTERN 1
\bI have:? # "I have"
(?![ ](?:to|been|\w+?[en]d)\b) # not followed by to|been|\w+[en]d
| # or
(?!\A)\G[ ] # anchored to previous match
?,?(?:[ ]?and)? # optional comma or "and"
) #
#
[ ](?:(?:an?|some)[ ])? # ARTICLE: a|an|some
#
\K # \K (reset match)
#
(?: # SUBPATTERN 2
(?! # Negative lookahead (exceptions)
[ ]*, # a. Comma to list another item
| # b. Article (a|an), some
[ ](?:a(?:nd?)?|some)\b # or and
) #
. # MATCH each character in a token
)+ # REPEAT Subpattern 2
One-liner:
(?:\bI have:?(?! (?:to|been|\w+?[en]d)\b)|(?!\A)\G ?,?(?: ?and)?) (?:(?:an?|some) )?\K(?:(?! *,| (?:a(?:nd?)?|some)\b).)+
Test in regex101
However, it should be tested to identify exceptions and use cases. This is how it behaves with the examples discussed in this post.
Matching the subject text:
Each match has been marked.
I have an eraser, a pencil and an item
She has a turtle and a car
I have an awesome motorcycle tatoo and a bag
I have to say I have a train and a bicycle
I have 3 bricks and 4 knees and a tie
Notice these are full matches, and not the value returned by a group. Simply add a group to enclose the "subpattern 2" to capture the tokens.
Test in regex101
I'm using notepad++'s regular expression search function to find all strings in a .txt document that do not contain a specific value (HIJ in the below example), where all strings begin with the same value (ABC in the below example).
How would I go about doing this?
Example
Every String starts with ABC
ABC is never used in a string other than at the beginning,
ABCABC123 would be two strings --"ABC" and "ABC123"
HIJ may appear multiple times in a string
I need to find the strings that do not contain HIJ
Input is one long file with no line breaks, but does contain special characters (*, ^, #, ~, :) and spaces
Example Input:
ABC1234HIJ56ABC7#HIJABC89ABCHIJ0ABE:HIJABC12~34HI456J
Example Input would be viewed as the following strings
ABC1234HIJ56
ABC7#HIJ
ABC89
ABCHIJ0ABE:HIJ
ABC12%34HI456J
The Third and Fifth strings both lack "HIJ" and therefore are included in the output, all others are not included in the output.
Example desired output:
ABC89
ABC12~34HI456J
I am 99% new to RegEx and will be looking more into it in the future, as my job description suddenly changed earlier this week when someone else in the company left suddenly, and therefore I have been doing this manually by searching (ABC|HIJ) and going through the search function's results looking for "ABC" appearing twice in a row. Supposedly the former employee was able to do this in an automated way, but left no documentation.
Any help would be appreciated!
This question is a repeat of a prior question I asked, but I was very very bad at formatting a question and it seems to have sunk beyond noticeable levels.
You can find the items you want with:
ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+(?=ABC|$)
Note: in this first pattern, you can replace (?=ABC|$) with (?!HIJ)
pattern details:
ABC
(?: # non-capturing group
[^HA]+ # all that is not a H or an A
| # OR
H(?!IJ) # an H not followed by IJ
|
A(?!BC) # an A not followed by BC
)*+ # repeat the group
(?=ABC|$) # followed by "ABC" or the end of the string
Note: if you want to remove all that is not the items you want you can make this search replace:
search: (?:ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+HIJ.*?(?=ABC|$))+|(?=ABC)
replace: \r\n
you could use this pattern
(ABC(?:(?!HIJ).)*?)(?=ABC|\R)
Demo
( # Capturing Group (1)
ABC # "ABC"
(?: # Non Capturing Group
(?! # Negative Look-Ahead
HIJ # "HIJ"
) # End of Negative Look-Ahead
. # Any character except line break
) # End of Non Capturing Group
*? # (zero or more)(lazy)
) # End of Capturing Group (1)
(?= # Look-Ahead
ABC # "ABC"
| # OR
\R # <line break>
) # End of Look-Ahead
You can use the following expression to match your criterion:
(^ABC(?:(?!HIJ).)*$)
This starts with ABC and looks ahead (negative) for HIJ pattern. The pattern works for the separated strings.
For a single line pattern (as provided in your question), a slight modification of this works (as follows):
(ABC(?:(?!HIJ).)*?)(?=ABC|$)
At the outset, let me explain that this question is neither about how to capture groups, nor about how to use quantifiers, two features of regex I am perfectly familiar with. It is more of an advanced question for regex lovers who may be familiar with unusual syntax in exotic engines.
Capturing Quantifiers
Does anyone know if a regex flavor allows you to capture quantifiers? By this, I mean that the number of characters matched by quantifiers such as + and * would be counted, and that this number could be used again in another quantifier.
For instance, suppose you wanted to make sure you have the same number of Ls and Rs in this kind of string: LLLRRRRR
You could imagine a syntax such as
L(+)R{\q1}
where the + quantifier for the L is captured, and where the captured number is referred to in the quantifier for the R as {\q1}
This would be useful to balance the number of {#,=,-,/} in strings such as
#### "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"
Relation to Recursion
In some cases quantifier capture would elegantly replace recursion, for instance a piece of text framed by the same number of Ls and Rs, a in
L(+) some_content R{\q1}
The idea is presented in some details on the following page: Captured Quantifiers
It also discusses a natural extension to captured quantifers: quantifier arithmetic, for occasions when you want to match (3*x + 1) the number of characters matched earlier.
I am trying to find out if anything like this exists.
Thanks in advance for your insights!!!
Update
Casimir gave a fantastic answer that shows two methods to validate that various parts of a pattern have the same length. However, I wouldn't want to rely on either of those for everyday work. These are really tricks that demonstrate great showmanship. In my mind, these beautiful but complex methods confirm the premise of the question: a regex feature to capture the number of characters that quantifers (such as + or *) are able to match would make such balancing patterns very simple and extend the syntax in a pleasingly expressive way.
Update 2 (much later)
I found out that .NET has a feature that comes close to what I was asking about. Added an answer to demonstrate the feature.
I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:#### "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"you can check if # = - / are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"
~(?<!#)((?:#(?=[^=]*(\2?+=)[^-]*(\3?+-)[^/]*(\4?+/)))+)(?!#)(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))~
pattern details:
~ # pattern delimiter
(?<!#) # negative lookbehind used as an # boundary
( # first capturing group for the #
(?:
# # one #
(?= # checks that each # is followed by the same number
# of = - /
[^=]* # all that is not an =
(\2?+=) # The possessive optional self-referencing group:
# capture group 2: backreference to itself + one =
[^-]*(\3?+-) # the same for -
[^/]*(\4?+/) # the same for /
) # close the lookahead
)+ # close the non-capturing group and repeat
) # close the first capturing group
(?!#) # negative lookahead used as an # boundary too.
# this checks the boundaries for all groups
(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))
~
The main idea
The non-capturing group contains only one #. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.
the possessive-optional self-referencing group
How does it work?
( (?: # (?= [^=]* (\2?+ = ) .....) )+ )
At the first occurence of the # character the capture group 2 is not yet defined, so you can not write something like that (\2 =) that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: \2?
The second aspect of this group is that the number of character = matched is incremented at each repetition of the non capturing group, since an = is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new = character.
Note that this group can be seen like that: if group 2 exists then match it with the next =
( (?(2)\2) = )
The recursive way
~(?<!#)(?=(#(?>[^#=]+|(?-1))*=)(?!=))(?=(#(?>[^#-]+|(?-1))*-)(?!-))(?=(#(?>[^#/]+|(?-1))*/)(?!/))~
You need to use overlapped matches, since you will use the # part several times, it is the reason why all the pattern is inside lookarounds.
pattern details:
(?<!#) # left # boundary
(?= # open a lookahead (to allow overlapped matches)
( # open a capturing group
#
(?> # open an atomic group
[^#=]+ # all that is not an # or an =, one or more times
| # OR
(?-1) # recursion: the last defined capturing group (the current here)
)* # repeat zero or more the atomic group
= #
) # close the capture group
(?!=) # checks the = boundary
) # close the lookahead
(?=(#(?>[^#-]+|(?-1))*-)(?!-)) # the same for -
(?=(#(?>[^#/]+|(?-1))*/)(?!/)) # the same for /
The main difference with the precedent pattern is that this one doesn't care about the order of = - and / groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)
Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^ or \A). And if you want to obtain the whole string as match result you must add .* at the end (otherwise the match result will be empty as playful notices it.)
Coming back five weeks later because I learned that .NET has something that comes very close to the idea of "quantifier capture" mentioned in the question. The feature is called "balancing groups".
Here is the solution I came up with. It looks long, but it is quite simple.
(?:#(?<c1>)(?<c2>)(?<c3>))+[^#=]+(?<-c1>=)+[^=-]+(?<-c2>-)+[^-/]+(?<-c3>/)+[^/]+(?(c1)(?!))(?(c2)(?!))(?(c3)(?!))
How does it work?
The first non-capturing group matches the # characters. In that non-capturing group, we have three named groups c1, c2 and c3 that don't match anything, or rather, that match an empty string. These groups will serve as three counters c1, c2 and c3. Because .NET keeps track of intermediate captures when a group is quantified, every time an # is matched, a capture is added to the capture collections for Groups c1, c2 and c3.
Next, [^#=]+ eats up all the characters up to the first =.
The second quantified group (?<-c1>=)+ matches the = characters. That group seems to be named -c1, but -c1 is not a group name. -c1 is.NET syntax to pop one capture from the c1 group's capture collection into the ether. In other words, it allows us to decrement c1. If you try to decrement c1 when the capture collection is empty, the match fails. This ensures that we can never have more = than # characters. (Later, we'll have to make sure that we cannot have more # than = characters.)
The next steps repeat steps 2 and 3 for the - and / characters, decrementing counters c2 and c3.
The [^/]+ eats up the rest of the string.
The (?(c1)(?!)) is a conditional that says "If group c1 has been set, then fail". You may know that (?!) is a common trick to force a regex to fail. This conditional ensures that c1 has been decremented all the way to zero: in other words, there cannot be more # than = characters.
Likewise, the (?(c2)(?!)) and (?(c3)(?!)) ensure that there cannot be more # than - and / characters.
I don't know about you, but even this is a bit long, I find it really intuitive.
How would I construct a regular expression to find all words that end in a string but don't begin with a string?
e.g. Find all words that end in 'friend' that don't start with the word 'girl' in the following sentence:
"A boyfriend and girlfriend gained a friend when they asked to befriend them"
The items in bold should match. The word 'girlfriend' should not.
Off the top of my head, you could try:
\b # word boundary - matches start of word
(?!girl) # negative lookahead for literal 'girl'
\w* # zero or more letters, numbers, or underscores
friend # literal 'friend'
\b # word boundary - matches end of word
Update
Here's another non-obvious approach which should work in any modern implementation of regular expressions:
Assuming you wish to extract a pattern which appears within multiple contexts but you only want to match if it appears in a specific context, you can use an alteration where you first specify what you don't want and then capture what you do.
So, using your example, to extract all of the words that either are or end in friend except girlfriend, you'd use:
\b # word boundary
(?: # start of non-capture group
girlfriend # literal (note 1)
| # alternation
( # start of capture group #1 (note 2)
\w* # zero or more word chars [a-zA-Z_]
friend # literal
) # end of capture group #1
) # end of non-capture group
\b
Notes:
This is what we do not wish to capture.
And this is what we do wish to capture.
Which can be described as:
for all words
first, match 'girlfriend' and do not capture (discard)
then match any word that is or ends in 'friend' and capture it
In Javascript:
const target = 'A boyfriend and girlfriend gained a friend when they asked to befriend them';
const pattern = /\b(?:girlfriend|(\w*friend))\b/g;
let result = [];
let arr;
while((arr=pattern.exec(target)) !== null){
if(arr[1]) {
result.push(arr[1]);
}
}
console.log(result);
which, when run, will print:
[ 'boyfriend', 'friend', 'befriend' ]
This may work:
\w*(?<!girl)friend
you could also try
\w*(?<!girl)friend\w* if you wanted to match words like befriended or boyfriends.
I'm not sure if ?<! is available in all regex versions, but this expression worked in Expersso (which I believe is .NET).
Try this:
/\b(?!girl)\w*friend\b/ig
I changed Rob Raisch's answer to a regexp that finds words Containing a specific substring, but not also containing a different specific substring
\b(?![\w_]*Unwanted[\w_]*)[\w_]*Desired[\w_]*\b
So for example \b(?![\w_]*mon[\w_]*)[\w_]*day[\w_]*\b will find every word with "day" (eg day , tuesday , daywalker ) in it, except if it also contains "mon" (eg monday)
Maybe useful for someone.
In my case I needed to exclude some words that have a given prefix from regex matching result
the text was query-string params
?=&sysNew=false&sysStart=true&sysOffset=4&Question=1
the prefix is sys and I dont the words that have sys in them
the key to solve the issue was with word boundary \b
\b(?!sys)\w+\b
then I added that part in the bigger regex for query-string
(\b(?!sys)\w+\b)=(\w+)