I have 3 values that I'm trying to match. foo, bar and 123. However I would like to match them only if they can be matched twice.
In the following line:
foo;bar;123;foo;123;
since bar is not present twice, it would only match:
foo;bar;123;foo;123;
I understand how to specify to match exactly two matches, (foo|bar|123){2} however I need to use backreferences in order to make it work in my example.
I'm struggling putting the two concepts together and making a working solution for this.
You could use
(?<=^|;)([^\n;]+)(?=.*(?:(?<=^|;)\1(?=;|$)))
Broken down, this is
(?<=^|;) # pos. loobehind, either start of string or ;
([^\n;]+) # not ; nor newline 1+ times
(?=.* # pos. lookahead
(?:
(?<=^|;) # same pattern as above
\1 # group 1
(?=;|$) # end or ;
)
)
\b # word boundary
([^;]+) # anything not ; 1+ times
\b # another word boundary
(?=.*\1) # pos. lookahead, making sure the pattern is found again
See a demo on regex101.com.
Otherwise - as said in the comments - split on the ; programmatically and use some programming logic afterwards.
Find a demo in Python for example (can be adjusted for other languages as well):
from collections import Counter
string = """
foo;bar;123;foo;123;
foo;bar;foo;bar;
foo;foo;foo;bar;bar;
"""
twins = [element
for line in string.split("\n")
for element, times in Counter(line.split(";")).most_common()
if times == 2]
print(twins)
making sure to allow room for text that may occur in between matches with a ".*", this should match any of your values that occur at least twice:
(foo|bar|123).*\1
Related
I am trying to understand how regex works. I understand it little by little. However, I don't understand this one completely. It's basically a regex for fully qualified domain names but a requirement is that the ending can't be .arpa.
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
https://regex101.com/r/hU6tP0/3
This doesn't match google.uk. If I change it to:
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{1,63}[^.arpa]$)
It works again.
But this works as well
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
Here is my thought process for
?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
I see it as this
(?=
Is a positive look ahead (Can someone explain to me what this actually means?) As I understand it now, it just means that the string needs to match the regex.
^.{4,253}$)
Match all characters but it needs to be between 4 and 253 characters long.
(^([a-zA-Z0-9]{1,63}\.)
Start a capture group and make another capture group within. This capture group says that every non special character can be written 1 to 63 times or till the . is written.
+
The previous capture group can be repeated indefinitely, but it should always end with a .. This way the next capture group is started.
[a-zA-Z]{2,63}
Then as many times as you want you can write a to z with upper, but it needs to be between 2 and 63.
[^.arpa]$)
The last characters can't be .arpa.
Can someone tell me where I am going wrong?
This doesn't do what you think it does:
[^.arpa]
All that says is 'ends with something that isn't one of the letter apr.' - it's a negated character class.
You might be thinking of a negative lookahead assertion:
(?!\.arpa)$
But if you're trying to compound multiple criteria in a regex, I'd suggest you're probably using the wrong tool for the job. It ends up complicated and hard to debug, thanks to greedy/non-greedy matching, etc.
Your 'positive/negative' lookaheads are to match a piece of a pattern that aren't surrounded by other pieces of pattern. But that can have some unexpected outcomes if you're matching variable widths, because the regex engine will backtrack until it finds something that matches.
A simpler example:
([\w.]+)(?!arpa)$
Applied to:
www.test.arpa
Will it match? What's in the group?
... it will match, because [\w\.]+ will consume all of it, and then the lookahead won't "see" anything.
If you use:
([\w]+)\.(?!arpa)
Instead though - you'll capture.... www, but you won't match test (with e.g. g flag, because the www doesn't have .arpa after it, but the test does.
https://regex101.com/r/hU6tP0/5
It really does get complicated using negative assertions in a pattern as a result. I'd suggest simply not doing so, and applying two separate tests. It's hard for you to figure out, and it's hard for a future maintenance programmer too!
This is an analysis of your regex:
(?=^.{4,253}$) # force min length: 4 chars, max length: 253 chars
( # Capturing Group 1 (CG1) - not needed
^ # Match start of the string
( # CG2 (can be a non capturing group '(?:...)')
[a-zA-Z0-9]{1,63} # any sequence of letters and numbers with length between 1 and 63
\. # a literal dot
)+ # CLOSE CG2
[a-zA-Z]{1,63} # any letter sequence with length between 1 to 63
[^.arpa] # a negated char class: any char that is not a "literal" '.','a','r','p' (last 'a' is redundant)
$ # end of the string
) # CLOSE CG1
To avoid the tail of the string to be .arpa you need to use a negative lookahead (?!...), so modify just like this:
(?=^.{4,253}$)(?!.*\.arpa$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
An online demo
Update:
I've upgraded the regex to rationalise it (i've incorporated also the Sobrique suggestion adding an important details):
/^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i
Compact version online demo
Legenda
/ # js regex delimiter
^ # start of the string
(?=.{4,253}$) # force min length: 4 chars, max length: 253 chars
(?: # Non capturing group 1 (NCG1)
[a-z0-9]{1,63} # any letter or digit in a sequence with length from 1 to 63 chars
[.] # a literal dot '.' (more readable than \.)
)+ # CLOSE NCG1 - repeat its content one or more time
(?!arpa$) # force that after the last literal dot '.' the string does not end with 'arpa' (i've added '$' to Sobrique suggestion instead it prevents also '.arpanet' too)
[a-z]{2,63} # a sequence of letters with length from 2 to 63
$ # end of the string
/i # Close the regex delimiter and add case insensitive flag [a-z] match also [A-Z] and viceversa
var re = /^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i;
var tests = ['google.uk','domain.arpa','domain.arpa2','another.domain.arpa.net','domain.arpanet'];
var m;
while(t = tests.pop()) {
document.getElementById("r").innerHTML += '"' + t + '"<br/>';
document.getElementById("r").innerHTML += 'Valid domain? ' + ( (t.match(re)) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>';
}
<div id="r"/>
I'm trying to create a regex to check the number of unique users.
In this case, 3 different users in 1 string means it's valid.
Let's say we have the following string
lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven
It contains the domain for each user (lab) and their first name.
Each user is seperated by ;
The goal is to have 3 unique users in a string.
In this case, the string is valid because we have the following unique users
simon, lieven, tim, davy = valid
If we take this string
lab\simon;lab\lieven;lab\simon
It's invalid because we only have 2 unique users
simon, lieven = invalid
So far, I've only come up with the following regex but I don't know how to continue
/(lab)\\(?:[a-zA-Z]*)/g
Could you help me with this regex?
Please let me know if you need more information if it's not clear.
What you are after cannot be achieved through regular expressions on their own. Regular expressions are to be used for parsing information and not processing.
There is no particular pattern you are after, which is what regular expression excel at. You will need to split by ; and use a data structure such as a set to store you string values.
Is this what you want:
1) Using regular expression:
import re
s = r'lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven'
pattern = re.compile(r'lab\\([A-z]{1,})')
user = re.findall(pattern, s)
if len(user) == len(set(user)) and len(user) >= 3:
print('Valid')
else:
print('Invalid')
2) Without using regular expression:
s = r'lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven'
users = [i.split('\\')[-1] for i in s.split(';')]
if len(users) == len(set(users)) and len(users) >= 3:
print('Valid')
else:
print('Invalid')
In order to have a successful match, we need at least 3 sets of lab\user, i.e:
(?:\\?lab\\[\w]+(?:;|$)){3}
You didn't specify your engine but with pythonyou can use:
import re
if re.search(r"(?:\\?lab\\[\w]+(?:;|$)){3}", string):
# Successful match
else:
# Match attempt failed
Regex Demo
Regex Explanation
(?:\\?lab\\[\w]+(?:;|$)){3}
Match the regular expression «(?:\\?lab\\[\w]+(?:;|$)){3}»
Exactly 3 times «{3}»
Match the backslash character «\\?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character string “lab” literally «lab»
Match the backslash character «\\»
Match a single character that is a “word character” «[\w]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:;|$)»
Match this alternative «;»
Match the character “;” literally «;»
Or match this alternative «$»
Assert position at the end of a line «$»
Here is a beginner-friendly way to solve your problem:
You should .split() the string per each "lab" section and declare the result as the array variable, like splitted_string.
Declare a second empty array to save each unique name, like unique_names.
Use a for loop to iterate through the splitted_string array. Check for unique names: if it isn't in your array of unique_names, add the name to unique_names.
Find the length of your array of unique_names to see if it is equal to 3. If yes, print that it is. If not, then print a fail message.
You seem like a practical person that is relatively new to string manipulation. Maybe you would enjoy some practical background reading on string manipulation at beginner sites like Automate The Boring Stuff With Python:
https://automatetheboringstuff.com/chapter6/
Or Codecademy, etc.
Another pure regex answer for the sport. As other said, you should probably not be doing this
^([^;]+)(;\1)*;((?!\1)[^;]+)(;(\1|\3))*;((?!\1|\3)[^;]+)
Explanation :
^ from the start of the string
([^;]+) we catch everything that isn't a ';'.
that's our first user, and our first capturing group
(;\1)* it could be repeated
;((?!\1)[^;]+) but at some point, we want to capture everything that isn't either
our first user nor a ';'. That's our second user,
and our third capturing group
(;(\1|\3))* both the first and second user can be repeated now
;((?!\1|\3)[^;]+) but at some point, we want to capture yada yada,
our third user and fifth capturing group
This can be done with a simple regex.
Uses a conditional for each user name slot so that the required
three names are obtained.
Note that since the three slots are in a loop, the conditional guarantees the
capture group is not overwritten (which would invalidate the below mentioned
assertion test (?! \1 | \2 | \3 ).
There is a complication. Each user name uses the same regex [a-zA-Z]+
so to accommodate that, a function is defined to check that the slot
has not been matched before.
This is using the boost engine, that cosmetically requires the group be
defined before it is back referenced.
The workaround is to define a function at the bottom after the group is defined.
In PERL (and some other engines) it is not required to define a group ahead
of time before its back referenced, so you could do away with the function
and put
(?! \1 | \2 | \3 ) # Cannot have seen this user
[a-zA-Z]+
in the capture groups on top.
At a minimum, this requires conditionals.
Formatted and tested:
# (?:(?:.*?\blab\\(?:((?(1)(?!))(?&GetUser))|((?(2)(?!))(?&GetUser))|((?(3)(?!))(?&GetUser))))){3}(?(DEFINE)(?<GetUser>(?!\1|\2|\3)[a-zA-Z]+))
# Look for 3 unique users
(?:
(?:
.*?
\b lab \\
(?:
( # (1), User 1
(?(1) (?!) )
(?&GetUser)
)
| ( # (2), User 2
(?(2) (?!) )
(?&GetUser)
)
| ( # (3), User 3
(?(3) (?!) )
(?&GetUser)
)
)
)
){3}
(?(DEFINE)
(?<GetUser> # (4)
(?! \1 | \2 | \3 ) # Cannot have seen this user
[a-zA-Z]+
)
)
In a text editor, I want to replace a given word with the number of the line number on which this word is found. Is this is possible with Regex?
Recursion, Self-Referencing Group (Qtax trick), Reverse Qtax or Balancing Groups
Introduction
The idea of adding a list of integers to the bottom of the input is similar to a famous database hack (nothing to do with regex) where one joins to a table of integers. My original answer used the #Qtax trick. The current answers use either Recursion, the Qtax trick (straight or in a reversed variation), or Balancing Groups.
Yes, it is possible... With some caveats and regex trickery.
The solutions in this answer are meant as a vehicle to demonstrate some regex syntax more than practical answers to be implemented.
At the end of your file, we will paste a list of numbers preceded with a unique delimiter. For this experiment, the appended string is :1:2:3:4:5:6:7 This is a similar technique to a famous database hack that uses a table of integers.
For the first two solutions, we need an editor that uses a regex flavor that allows recursion (solution 1) or self-referencing capture groups (solutions 2 and 3). Two come to mind: Notepad++ and EditPad Pro. For the third solution, we need an editor that supports balancing groups. That probably limits us to EditPad Pro or Visual Studio 2013+.
Input file:
Let's say we are searching for pig and want to replace it with the line number.
We'll use this as input:
my cat
dog
my pig
my cow
my mouse
:1:2:3:4:5:6:7
First Solution: Recursion
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested).
The recursive structure lives in a lookahead, and is optional. Its job is to balance lines that don't contain pig, on the left, with numbers, on the right: think of it as balancing a nested construct like {{{ }}}... Except that on the left we have the no-match lines, and on the right we have the numbers. The point is that when we exit the lookahead, we know how many lines were skipped.
Search:
(?sm)(?=.*?pig)(?=((?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?:(?1)|[^:]+)(:\d+))?).*?\Kpig(?=.*?(?(2)\2):(\d+))
Free-Spacing Version with Comments:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # fail right away if pig isn't there
(?= # The Recursive Structure Lives In This Lookahead
( # Group 1
(?: # skip one line
^
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
(?:(?1)|[^:]+) # recurse Group 1 OR match all chars that are not a :
(:\d+) # match digits
)? # End Group
) # End lookahead.
.*?\Kpig # get to pig
(?=.*?(?(2)\2):(\d+)) # Lookahead: capture the next digits
Replace: \3
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Second Solution: Group that Refers to Itself ("Qtax Trick")
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested). The solution is easy to adapt to .NET by converting the \K to a lookahead and the possessive quantifier to an atomic group (see the .NET Version a few lines below.)
Search:
(?sm)(?=.*?pig)(?:(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*+.*?\Kpig(?=[^:]+(?(1)\1):(\d+))
.NET version: Back to the Future
.NET does not have \K. It its place, we use a "back to the future" lookbehind (a lookbehind that contains a lookahead that skips ahead of the match). Also, we need to use an atomic group instead of a possessive quantifier.
(?sm)(?<=(?=.*?pig)(?=(?>(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*).*)pig(?=[^:]+(?(1)\1):(\d+))
Free-Spacing Version with Comments (Perl / PCRE Version):
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# for each line skipped, let Group 1 match an ever increasing portion of the numbers string at the bottom
(?= # lookahead
[^:]+ # skip all chars that are not colons
( # start Group 1
(?(1)\1) # match Group 1 if set
:\d+ # match a colon and some digits
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop everything we've matched so far
pig # match pig (this is the match!)
(?=[^:]+(?(1)\1):(\d+)) # capture the next number to Group 2
Replace:
\2
Output:
my cat
dog
my 3
my cow
my mouse
:1:2:3:4:5:6:7
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Choice of Delimiter for Digits
In our example, the delimiter : for the string of digits is rather common, and could happen elsewhere. We can invent a UNIQUE_DELIMITER and tweak the expression slightly. But the following optimization is even more efficient and lets us keep the :
Optimization on Second Solution: Reverse String of Digits
Instead of pasting our digits in order, it may be to our benefit to use them in the reverse order: :7:6:5:4:3:2:1
In our lookaheads, this allows us to get down to the bottom of the input with a simple .*, and to start backtracking from there. Since we know we're at the end of the string, we don't have to worry about the :digits being part of another section of the string. Here's how to do it.
Input:
my cat pi g
dog p ig
my pig
my cow
my mouse
:7:6:5:4:3:2:1
Search:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line that doesn't have pig
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# Group 1 matches increasing portion of the numbers string at the bottom
(?= # lookahead
.* # get to the end of the input
( # start Group 1
:\d+ # match a colon and some digits
(?(1)\1) # match Group 1 if set
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop match so far
pig # match pig (this is the match!)
(?=.*(\d+)(?(1)\1)) # capture the next number to Group 2
Replace: \2
See the substitutions in the demo.
Third Solution: Balancing Groups
This solution is specific to .NET.
Search:
(?m)(?<=\A(?<c>^(?:(?!pig)[^\r\n])*(?:\r?\n))*.*?)pig(?=[^:]+(?(c)(?<-c>:\d+)*):(\d+))
Free-Spacing Version with Comments:
(?xm) # free-spacing, multi-line
(?<= # lookbehind
\A #
(?<c> # skip one line that doesn't have pig
# The length of Group c Captures will serve as a counter
^ # beginning of line
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
) # end skipper
* # repeat skipper
.*? # we're on the pig line: lazily match chars before pig
) # end lookbehind
pig # match pig: this is the match
(?= # lookahead
[^:]+ # get to the digits
(?(c) # if Group c has been set
(?<-c>:\d+) # decrement c while we match a group of digits
* # repeat: this will only repeat as long as the length of Group c captures > 0
) # end if Group c has been set
:(\d+) # Match the next digit group, capture the digits
) # end lokahead
Replace: $1
Reference
Qtax trick
On Which Line Number Was the Regex Match Found?
Because you didn't specify which text editor, in vim it would be:
:%s/searched_word/\=printf('%-4d', line('.'))/g (read more)
But as somebody mentioned it's not a question for SO but rather Super User ;)
I don't know of an editor that does that short of extending an editor that allows arbitrary extensions.
You could easily use perl to do the task, though.
perl -i.bak -e"s/word/$./eg" file
Or if you want to use wildcards,
perl -MFile::DosGlob=glob -i.bak -e"BEGIN { #ARGV = map glob($_), #ARGV } s/word/$./eg" *.txt
At the outset, let me explain that this question is neither about how to capture groups, nor about how to use quantifiers, two features of regex I am perfectly familiar with. It is more of an advanced question for regex lovers who may be familiar with unusual syntax in exotic engines.
Capturing Quantifiers
Does anyone know if a regex flavor allows you to capture quantifiers? By this, I mean that the number of characters matched by quantifiers such as + and * would be counted, and that this number could be used again in another quantifier.
For instance, suppose you wanted to make sure you have the same number of Ls and Rs in this kind of string: LLLRRRRR
You could imagine a syntax such as
L(+)R{\q1}
where the + quantifier for the L is captured, and where the captured number is referred to in the quantifier for the R as {\q1}
This would be useful to balance the number of {#,=,-,/} in strings such as
#### "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"
Relation to Recursion
In some cases quantifier capture would elegantly replace recursion, for instance a piece of text framed by the same number of Ls and Rs, a in
L(+) some_content R{\q1}
The idea is presented in some details on the following page: Captured Quantifiers
It also discusses a natural extension to captured quantifers: quantifier arithmetic, for occasions when you want to match (3*x + 1) the number of characters matched earlier.
I am trying to find out if anything like this exists.
Thanks in advance for your insights!!!
Update
Casimir gave a fantastic answer that shows two methods to validate that various parts of a pattern have the same length. However, I wouldn't want to rely on either of those for everyday work. These are really tricks that demonstrate great showmanship. In my mind, these beautiful but complex methods confirm the premise of the question: a regex feature to capture the number of characters that quantifers (such as + or *) are able to match would make such balancing patterns very simple and extend the syntax in a pleasingly expressive way.
Update 2 (much later)
I found out that .NET has a feature that comes close to what I was asking about. Added an answer to demonstrate the feature.
I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:#### "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"you can check if # = - / are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"
~(?<!#)((?:#(?=[^=]*(\2?+=)[^-]*(\3?+-)[^/]*(\4?+/)))+)(?!#)(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))~
pattern details:
~ # pattern delimiter
(?<!#) # negative lookbehind used as an # boundary
( # first capturing group for the #
(?:
# # one #
(?= # checks that each # is followed by the same number
# of = - /
[^=]* # all that is not an =
(\2?+=) # The possessive optional self-referencing group:
# capture group 2: backreference to itself + one =
[^-]*(\3?+-) # the same for -
[^/]*(\4?+/) # the same for /
) # close the lookahead
)+ # close the non-capturing group and repeat
) # close the first capturing group
(?!#) # negative lookahead used as an # boundary too.
# this checks the boundaries for all groups
(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))
~
The main idea
The non-capturing group contains only one #. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.
the possessive-optional self-referencing group
How does it work?
( (?: # (?= [^=]* (\2?+ = ) .....) )+ )
At the first occurence of the # character the capture group 2 is not yet defined, so you can not write something like that (\2 =) that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: \2?
The second aspect of this group is that the number of character = matched is incremented at each repetition of the non capturing group, since an = is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new = character.
Note that this group can be seen like that: if group 2 exists then match it with the next =
( (?(2)\2) = )
The recursive way
~(?<!#)(?=(#(?>[^#=]+|(?-1))*=)(?!=))(?=(#(?>[^#-]+|(?-1))*-)(?!-))(?=(#(?>[^#/]+|(?-1))*/)(?!/))~
You need to use overlapped matches, since you will use the # part several times, it is the reason why all the pattern is inside lookarounds.
pattern details:
(?<!#) # left # boundary
(?= # open a lookahead (to allow overlapped matches)
( # open a capturing group
#
(?> # open an atomic group
[^#=]+ # all that is not an # or an =, one or more times
| # OR
(?-1) # recursion: the last defined capturing group (the current here)
)* # repeat zero or more the atomic group
= #
) # close the capture group
(?!=) # checks the = boundary
) # close the lookahead
(?=(#(?>[^#-]+|(?-1))*-)(?!-)) # the same for -
(?=(#(?>[^#/]+|(?-1))*/)(?!/)) # the same for /
The main difference with the precedent pattern is that this one doesn't care about the order of = - and / groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)
Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^ or \A). And if you want to obtain the whole string as match result you must add .* at the end (otherwise the match result will be empty as playful notices it.)
Coming back five weeks later because I learned that .NET has something that comes very close to the idea of "quantifier capture" mentioned in the question. The feature is called "balancing groups".
Here is the solution I came up with. It looks long, but it is quite simple.
(?:#(?<c1>)(?<c2>)(?<c3>))+[^#=]+(?<-c1>=)+[^=-]+(?<-c2>-)+[^-/]+(?<-c3>/)+[^/]+(?(c1)(?!))(?(c2)(?!))(?(c3)(?!))
How does it work?
The first non-capturing group matches the # characters. In that non-capturing group, we have three named groups c1, c2 and c3 that don't match anything, or rather, that match an empty string. These groups will serve as three counters c1, c2 and c3. Because .NET keeps track of intermediate captures when a group is quantified, every time an # is matched, a capture is added to the capture collections for Groups c1, c2 and c3.
Next, [^#=]+ eats up all the characters up to the first =.
The second quantified group (?<-c1>=)+ matches the = characters. That group seems to be named -c1, but -c1 is not a group name. -c1 is.NET syntax to pop one capture from the c1 group's capture collection into the ether. In other words, it allows us to decrement c1. If you try to decrement c1 when the capture collection is empty, the match fails. This ensures that we can never have more = than # characters. (Later, we'll have to make sure that we cannot have more # than = characters.)
The next steps repeat steps 2 and 3 for the - and / characters, decrementing counters c2 and c3.
The [^/]+ eats up the rest of the string.
The (?(c1)(?!)) is a conditional that says "If group c1 has been set, then fail". You may know that (?!) is a common trick to force a regex to fail. This conditional ensures that c1 has been decremented all the way to zero: in other words, there cannot be more # than = characters.
Likewise, the (?(c2)(?!)) and (?(c3)(?!)) ensure that there cannot be more # than - and / characters.
I don't know about you, but even this is a bit long, I find it really intuitive.
How would I construct a regular expression to find all words that end in a string but don't begin with a string?
e.g. Find all words that end in 'friend' that don't start with the word 'girl' in the following sentence:
"A boyfriend and girlfriend gained a friend when they asked to befriend them"
The items in bold should match. The word 'girlfriend' should not.
Off the top of my head, you could try:
\b # word boundary - matches start of word
(?!girl) # negative lookahead for literal 'girl'
\w* # zero or more letters, numbers, or underscores
friend # literal 'friend'
\b # word boundary - matches end of word
Update
Here's another non-obvious approach which should work in any modern implementation of regular expressions:
Assuming you wish to extract a pattern which appears within multiple contexts but you only want to match if it appears in a specific context, you can use an alteration where you first specify what you don't want and then capture what you do.
So, using your example, to extract all of the words that either are or end in friend except girlfriend, you'd use:
\b # word boundary
(?: # start of non-capture group
girlfriend # literal (note 1)
| # alternation
( # start of capture group #1 (note 2)
\w* # zero or more word chars [a-zA-Z_]
friend # literal
) # end of capture group #1
) # end of non-capture group
\b
Notes:
This is what we do not wish to capture.
And this is what we do wish to capture.
Which can be described as:
for all words
first, match 'girlfriend' and do not capture (discard)
then match any word that is or ends in 'friend' and capture it
In Javascript:
const target = 'A boyfriend and girlfriend gained a friend when they asked to befriend them';
const pattern = /\b(?:girlfriend|(\w*friend))\b/g;
let result = [];
let arr;
while((arr=pattern.exec(target)) !== null){
if(arr[1]) {
result.push(arr[1]);
}
}
console.log(result);
which, when run, will print:
[ 'boyfriend', 'friend', 'befriend' ]
This may work:
\w*(?<!girl)friend
you could also try
\w*(?<!girl)friend\w* if you wanted to match words like befriended or boyfriends.
I'm not sure if ?<! is available in all regex versions, but this expression worked in Expersso (which I believe is .NET).
Try this:
/\b(?!girl)\w*friend\b/ig
I changed Rob Raisch's answer to a regexp that finds words Containing a specific substring, but not also containing a different specific substring
\b(?![\w_]*Unwanted[\w_]*)[\w_]*Desired[\w_]*\b
So for example \b(?![\w_]*mon[\w_]*)[\w_]*day[\w_]*\b will find every word with "day" (eg day , tuesday , daywalker ) in it, except if it also contains "mon" (eg monday)
Maybe useful for someone.
In my case I needed to exclude some words that have a given prefix from regex matching result
the text was query-string params
?=&sysNew=false&sysStart=true&sysOffset=4&Question=1
the prefix is sys and I dont the words that have sys in them
the key to solve the issue was with word boundary \b
\b(?!sys)\w+\b
then I added that part in the bigger regex for query-string
(\b(?!sys)\w+\b)=(\w+)