At the outset, let me explain that this question is neither about how to capture groups, nor about how to use quantifiers, two features of regex I am perfectly familiar with. It is more of an advanced question for regex lovers who may be familiar with unusual syntax in exotic engines.
Capturing Quantifiers
Does anyone know if a regex flavor allows you to capture quantifiers? By this, I mean that the number of characters matched by quantifiers such as + and * would be counted, and that this number could be used again in another quantifier.
For instance, suppose you wanted to make sure you have the same number of Ls and Rs in this kind of string: LLLRRRRR
You could imagine a syntax such as
L(+)R{\q1}
where the + quantifier for the L is captured, and where the captured number is referred to in the quantifier for the R as {\q1}
This would be useful to balance the number of {#,=,-,/} in strings such as
#### "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"
Relation to Recursion
In some cases quantifier capture would elegantly replace recursion, for instance a piece of text framed by the same number of Ls and Rs, a in
L(+) some_content R{\q1}
The idea is presented in some details on the following page: Captured Quantifiers
It also discusses a natural extension to captured quantifers: quantifier arithmetic, for occasions when you want to match (3*x + 1) the number of characters matched earlier.
I am trying to find out if anything like this exists.
Thanks in advance for your insights!!!
Update
Casimir gave a fantastic answer that shows two methods to validate that various parts of a pattern have the same length. However, I wouldn't want to rely on either of those for everyday work. These are really tricks that demonstrate great showmanship. In my mind, these beautiful but complex methods confirm the premise of the question: a regex feature to capture the number of characters that quantifers (such as + or *) are able to match would make such balancing patterns very simple and extend the syntax in a pleasingly expressive way.
Update 2 (much later)
I found out that .NET has a feature that comes close to what I was asking about. Added an answer to demonstrate the feature.
I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:#### "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"you can check if # = - / are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"
~(?<!#)((?:#(?=[^=]*(\2?+=)[^-]*(\3?+-)[^/]*(\4?+/)))+)(?!#)(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))~
pattern details:
~ # pattern delimiter
(?<!#) # negative lookbehind used as an # boundary
( # first capturing group for the #
(?:
# # one #
(?= # checks that each # is followed by the same number
# of = - /
[^=]* # all that is not an =
(\2?+=) # The possessive optional self-referencing group:
# capture group 2: backreference to itself + one =
[^-]*(\3?+-) # the same for -
[^/]*(\4?+/) # the same for /
) # close the lookahead
)+ # close the non-capturing group and repeat
) # close the first capturing group
(?!#) # negative lookahead used as an # boundary too.
# this checks the boundaries for all groups
(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))
~
The main idea
The non-capturing group contains only one #. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.
the possessive-optional self-referencing group
How does it work?
( (?: # (?= [^=]* (\2?+ = ) .....) )+ )
At the first occurence of the # character the capture group 2 is not yet defined, so you can not write something like that (\2 =) that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: \2?
The second aspect of this group is that the number of character = matched is incremented at each repetition of the non capturing group, since an = is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new = character.
Note that this group can be seen like that: if group 2 exists then match it with the next =
( (?(2)\2) = )
The recursive way
~(?<!#)(?=(#(?>[^#=]+|(?-1))*=)(?!=))(?=(#(?>[^#-]+|(?-1))*-)(?!-))(?=(#(?>[^#/]+|(?-1))*/)(?!/))~
You need to use overlapped matches, since you will use the # part several times, it is the reason why all the pattern is inside lookarounds.
pattern details:
(?<!#) # left # boundary
(?= # open a lookahead (to allow overlapped matches)
( # open a capturing group
#
(?> # open an atomic group
[^#=]+ # all that is not an # or an =, one or more times
| # OR
(?-1) # recursion: the last defined capturing group (the current here)
)* # repeat zero or more the atomic group
= #
) # close the capture group
(?!=) # checks the = boundary
) # close the lookahead
(?=(#(?>[^#-]+|(?-1))*-)(?!-)) # the same for -
(?=(#(?>[^#/]+|(?-1))*/)(?!/)) # the same for /
The main difference with the precedent pattern is that this one doesn't care about the order of = - and / groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)
Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^ or \A). And if you want to obtain the whole string as match result you must add .* at the end (otherwise the match result will be empty as playful notices it.)
Coming back five weeks later because I learned that .NET has something that comes very close to the idea of "quantifier capture" mentioned in the question. The feature is called "balancing groups".
Here is the solution I came up with. It looks long, but it is quite simple.
(?:#(?<c1>)(?<c2>)(?<c3>))+[^#=]+(?<-c1>=)+[^=-]+(?<-c2>-)+[^-/]+(?<-c3>/)+[^/]+(?(c1)(?!))(?(c2)(?!))(?(c3)(?!))
How does it work?
The first non-capturing group matches the # characters. In that non-capturing group, we have three named groups c1, c2 and c3 that don't match anything, or rather, that match an empty string. These groups will serve as three counters c1, c2 and c3. Because .NET keeps track of intermediate captures when a group is quantified, every time an # is matched, a capture is added to the capture collections for Groups c1, c2 and c3.
Next, [^#=]+ eats up all the characters up to the first =.
The second quantified group (?<-c1>=)+ matches the = characters. That group seems to be named -c1, but -c1 is not a group name. -c1 is.NET syntax to pop one capture from the c1 group's capture collection into the ether. In other words, it allows us to decrement c1. If you try to decrement c1 when the capture collection is empty, the match fails. This ensures that we can never have more = than # characters. (Later, we'll have to make sure that we cannot have more # than = characters.)
The next steps repeat steps 2 and 3 for the - and / characters, decrementing counters c2 and c3.
The [^/]+ eats up the rest of the string.
The (?(c1)(?!)) is a conditional that says "If group c1 has been set, then fail". You may know that (?!) is a common trick to force a regex to fail. This conditional ensures that c1 has been decremented all the way to zero: in other words, there cannot be more # than = characters.
Likewise, the (?(c2)(?!)) and (?(c3)(?!)) ensure that there cannot be more # than - and / characters.
I don't know about you, but even this is a bit long, I find it really intuitive.
Related
I have 3 values that I'm trying to match. foo, bar and 123. However I would like to match them only if they can be matched twice.
In the following line:
foo;bar;123;foo;123;
since bar is not present twice, it would only match:
foo;bar;123;foo;123;
I understand how to specify to match exactly two matches, (foo|bar|123){2} however I need to use backreferences in order to make it work in my example.
I'm struggling putting the two concepts together and making a working solution for this.
You could use
(?<=^|;)([^\n;]+)(?=.*(?:(?<=^|;)\1(?=;|$)))
Broken down, this is
(?<=^|;) # pos. loobehind, either start of string or ;
([^\n;]+) # not ; nor newline 1+ times
(?=.* # pos. lookahead
(?:
(?<=^|;) # same pattern as above
\1 # group 1
(?=;|$) # end or ;
)
)
\b # word boundary
([^;]+) # anything not ; 1+ times
\b # another word boundary
(?=.*\1) # pos. lookahead, making sure the pattern is found again
See a demo on regex101.com.
Otherwise - as said in the comments - split on the ; programmatically and use some programming logic afterwards.
Find a demo in Python for example (can be adjusted for other languages as well):
from collections import Counter
string = """
foo;bar;123;foo;123;
foo;bar;foo;bar;
foo;foo;foo;bar;bar;
"""
twins = [element
for line in string.split("\n")
for element, times in Counter(line.split(";")).most_common()
if times == 2]
print(twins)
making sure to allow room for text that may occur in between matches with a ".*", this should match any of your values that occur at least twice:
(foo|bar|123).*\1
I am trying to understand how regex works. I understand it little by little. However, I don't understand this one completely. It's basically a regex for fully qualified domain names but a requirement is that the ending can't be .arpa.
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
https://regex101.com/r/hU6tP0/3
This doesn't match google.uk. If I change it to:
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{1,63}[^.arpa]$)
It works again.
But this works as well
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
Here is my thought process for
?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
I see it as this
(?=
Is a positive look ahead (Can someone explain to me what this actually means?) As I understand it now, it just means that the string needs to match the regex.
^.{4,253}$)
Match all characters but it needs to be between 4 and 253 characters long.
(^([a-zA-Z0-9]{1,63}\.)
Start a capture group and make another capture group within. This capture group says that every non special character can be written 1 to 63 times or till the . is written.
+
The previous capture group can be repeated indefinitely, but it should always end with a .. This way the next capture group is started.
[a-zA-Z]{2,63}
Then as many times as you want you can write a to z with upper, but it needs to be between 2 and 63.
[^.arpa]$)
The last characters can't be .arpa.
Can someone tell me where I am going wrong?
This doesn't do what you think it does:
[^.arpa]
All that says is 'ends with something that isn't one of the letter apr.' - it's a negated character class.
You might be thinking of a negative lookahead assertion:
(?!\.arpa)$
But if you're trying to compound multiple criteria in a regex, I'd suggest you're probably using the wrong tool for the job. It ends up complicated and hard to debug, thanks to greedy/non-greedy matching, etc.
Your 'positive/negative' lookaheads are to match a piece of a pattern that aren't surrounded by other pieces of pattern. But that can have some unexpected outcomes if you're matching variable widths, because the regex engine will backtrack until it finds something that matches.
A simpler example:
([\w.]+)(?!arpa)$
Applied to:
www.test.arpa
Will it match? What's in the group?
... it will match, because [\w\.]+ will consume all of it, and then the lookahead won't "see" anything.
If you use:
([\w]+)\.(?!arpa)
Instead though - you'll capture.... www, but you won't match test (with e.g. g flag, because the www doesn't have .arpa after it, but the test does.
https://regex101.com/r/hU6tP0/5
It really does get complicated using negative assertions in a pattern as a result. I'd suggest simply not doing so, and applying two separate tests. It's hard for you to figure out, and it's hard for a future maintenance programmer too!
This is an analysis of your regex:
(?=^.{4,253}$) # force min length: 4 chars, max length: 253 chars
( # Capturing Group 1 (CG1) - not needed
^ # Match start of the string
( # CG2 (can be a non capturing group '(?:...)')
[a-zA-Z0-9]{1,63} # any sequence of letters and numbers with length between 1 and 63
\. # a literal dot
)+ # CLOSE CG2
[a-zA-Z]{1,63} # any letter sequence with length between 1 to 63
[^.arpa] # a negated char class: any char that is not a "literal" '.','a','r','p' (last 'a' is redundant)
$ # end of the string
) # CLOSE CG1
To avoid the tail of the string to be .arpa you need to use a negative lookahead (?!...), so modify just like this:
(?=^.{4,253}$)(?!.*\.arpa$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
An online demo
Update:
I've upgraded the regex to rationalise it (i've incorporated also the Sobrique suggestion adding an important details):
/^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i
Compact version online demo
Legenda
/ # js regex delimiter
^ # start of the string
(?=.{4,253}$) # force min length: 4 chars, max length: 253 chars
(?: # Non capturing group 1 (NCG1)
[a-z0-9]{1,63} # any letter or digit in a sequence with length from 1 to 63 chars
[.] # a literal dot '.' (more readable than \.)
)+ # CLOSE NCG1 - repeat its content one or more time
(?!arpa$) # force that after the last literal dot '.' the string does not end with 'arpa' (i've added '$' to Sobrique suggestion instead it prevents also '.arpanet' too)
[a-z]{2,63} # a sequence of letters with length from 2 to 63
$ # end of the string
/i # Close the regex delimiter and add case insensitive flag [a-z] match also [A-Z] and viceversa
var re = /^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i;
var tests = ['google.uk','domain.arpa','domain.arpa2','another.domain.arpa.net','domain.arpanet'];
var m;
while(t = tests.pop()) {
document.getElementById("r").innerHTML += '"' + t + '"<br/>';
document.getElementById("r").innerHTML += 'Valid domain? ' + ( (t.match(re)) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>';
}
<div id="r"/>
I'm trying to create a regex to check the number of unique users.
In this case, 3 different users in 1 string means it's valid.
Let's say we have the following string
lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven
It contains the domain for each user (lab) and their first name.
Each user is seperated by ;
The goal is to have 3 unique users in a string.
In this case, the string is valid because we have the following unique users
simon, lieven, tim, davy = valid
If we take this string
lab\simon;lab\lieven;lab\simon
It's invalid because we only have 2 unique users
simon, lieven = invalid
So far, I've only come up with the following regex but I don't know how to continue
/(lab)\\(?:[a-zA-Z]*)/g
Could you help me with this regex?
Please let me know if you need more information if it's not clear.
What you are after cannot be achieved through regular expressions on their own. Regular expressions are to be used for parsing information and not processing.
There is no particular pattern you are after, which is what regular expression excel at. You will need to split by ; and use a data structure such as a set to store you string values.
Is this what you want:
1) Using regular expression:
import re
s = r'lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven'
pattern = re.compile(r'lab\\([A-z]{1,})')
user = re.findall(pattern, s)
if len(user) == len(set(user)) and len(user) >= 3:
print('Valid')
else:
print('Invalid')
2) Without using regular expression:
s = r'lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven'
users = [i.split('\\')[-1] for i in s.split(';')]
if len(users) == len(set(users)) and len(users) >= 3:
print('Valid')
else:
print('Invalid')
In order to have a successful match, we need at least 3 sets of lab\user, i.e:
(?:\\?lab\\[\w]+(?:;|$)){3}
You didn't specify your engine but with pythonyou can use:
import re
if re.search(r"(?:\\?lab\\[\w]+(?:;|$)){3}", string):
# Successful match
else:
# Match attempt failed
Regex Demo
Regex Explanation
(?:\\?lab\\[\w]+(?:;|$)){3}
Match the regular expression «(?:\\?lab\\[\w]+(?:;|$)){3}»
Exactly 3 times «{3}»
Match the backslash character «\\?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character string “lab” literally «lab»
Match the backslash character «\\»
Match a single character that is a “word character” «[\w]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:;|$)»
Match this alternative «;»
Match the character “;” literally «;»
Or match this alternative «$»
Assert position at the end of a line «$»
Here is a beginner-friendly way to solve your problem:
You should .split() the string per each "lab" section and declare the result as the array variable, like splitted_string.
Declare a second empty array to save each unique name, like unique_names.
Use a for loop to iterate through the splitted_string array. Check for unique names: if it isn't in your array of unique_names, add the name to unique_names.
Find the length of your array of unique_names to see if it is equal to 3. If yes, print that it is. If not, then print a fail message.
You seem like a practical person that is relatively new to string manipulation. Maybe you would enjoy some practical background reading on string manipulation at beginner sites like Automate The Boring Stuff With Python:
https://automatetheboringstuff.com/chapter6/
Or Codecademy, etc.
Another pure regex answer for the sport. As other said, you should probably not be doing this
^([^;]+)(;\1)*;((?!\1)[^;]+)(;(\1|\3))*;((?!\1|\3)[^;]+)
Explanation :
^ from the start of the string
([^;]+) we catch everything that isn't a ';'.
that's our first user, and our first capturing group
(;\1)* it could be repeated
;((?!\1)[^;]+) but at some point, we want to capture everything that isn't either
our first user nor a ';'. That's our second user,
and our third capturing group
(;(\1|\3))* both the first and second user can be repeated now
;((?!\1|\3)[^;]+) but at some point, we want to capture yada yada,
our third user and fifth capturing group
This can be done with a simple regex.
Uses a conditional for each user name slot so that the required
three names are obtained.
Note that since the three slots are in a loop, the conditional guarantees the
capture group is not overwritten (which would invalidate the below mentioned
assertion test (?! \1 | \2 | \3 ).
There is a complication. Each user name uses the same regex [a-zA-Z]+
so to accommodate that, a function is defined to check that the slot
has not been matched before.
This is using the boost engine, that cosmetically requires the group be
defined before it is back referenced.
The workaround is to define a function at the bottom after the group is defined.
In PERL (and some other engines) it is not required to define a group ahead
of time before its back referenced, so you could do away with the function
and put
(?! \1 | \2 | \3 ) # Cannot have seen this user
[a-zA-Z]+
in the capture groups on top.
At a minimum, this requires conditionals.
Formatted and tested:
# (?:(?:.*?\blab\\(?:((?(1)(?!))(?&GetUser))|((?(2)(?!))(?&GetUser))|((?(3)(?!))(?&GetUser))))){3}(?(DEFINE)(?<GetUser>(?!\1|\2|\3)[a-zA-Z]+))
# Look for 3 unique users
(?:
(?:
.*?
\b lab \\
(?:
( # (1), User 1
(?(1) (?!) )
(?&GetUser)
)
| ( # (2), User 2
(?(2) (?!) )
(?&GetUser)
)
| ( # (3), User 3
(?(3) (?!) )
(?&GetUser)
)
)
)
){3}
(?(DEFINE)
(?<GetUser> # (4)
(?! \1 | \2 | \3 ) # Cannot have seen this user
[a-zA-Z]+
)
)
In a text editor, I want to replace a given word with the number of the line number on which this word is found. Is this is possible with Regex?
Recursion, Self-Referencing Group (Qtax trick), Reverse Qtax or Balancing Groups
Introduction
The idea of adding a list of integers to the bottom of the input is similar to a famous database hack (nothing to do with regex) where one joins to a table of integers. My original answer used the #Qtax trick. The current answers use either Recursion, the Qtax trick (straight or in a reversed variation), or Balancing Groups.
Yes, it is possible... With some caveats and regex trickery.
The solutions in this answer are meant as a vehicle to demonstrate some regex syntax more than practical answers to be implemented.
At the end of your file, we will paste a list of numbers preceded with a unique delimiter. For this experiment, the appended string is :1:2:3:4:5:6:7 This is a similar technique to a famous database hack that uses a table of integers.
For the first two solutions, we need an editor that uses a regex flavor that allows recursion (solution 1) or self-referencing capture groups (solutions 2 and 3). Two come to mind: Notepad++ and EditPad Pro. For the third solution, we need an editor that supports balancing groups. That probably limits us to EditPad Pro or Visual Studio 2013+.
Input file:
Let's say we are searching for pig and want to replace it with the line number.
We'll use this as input:
my cat
dog
my pig
my cow
my mouse
:1:2:3:4:5:6:7
First Solution: Recursion
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested).
The recursive structure lives in a lookahead, and is optional. Its job is to balance lines that don't contain pig, on the left, with numbers, on the right: think of it as balancing a nested construct like {{{ }}}... Except that on the left we have the no-match lines, and on the right we have the numbers. The point is that when we exit the lookahead, we know how many lines were skipped.
Search:
(?sm)(?=.*?pig)(?=((?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?:(?1)|[^:]+)(:\d+))?).*?\Kpig(?=.*?(?(2)\2):(\d+))
Free-Spacing Version with Comments:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # fail right away if pig isn't there
(?= # The Recursive Structure Lives In This Lookahead
( # Group 1
(?: # skip one line
^
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
(?:(?1)|[^:]+) # recurse Group 1 OR match all chars that are not a :
(:\d+) # match digits
)? # End Group
) # End lookahead.
.*?\Kpig # get to pig
(?=.*?(?(2)\2):(\d+)) # Lookahead: capture the next digits
Replace: \3
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Second Solution: Group that Refers to Itself ("Qtax Trick")
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested). The solution is easy to adapt to .NET by converting the \K to a lookahead and the possessive quantifier to an atomic group (see the .NET Version a few lines below.)
Search:
(?sm)(?=.*?pig)(?:(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*+.*?\Kpig(?=[^:]+(?(1)\1):(\d+))
.NET version: Back to the Future
.NET does not have \K. It its place, we use a "back to the future" lookbehind (a lookbehind that contains a lookahead that skips ahead of the match). Also, we need to use an atomic group instead of a possessive quantifier.
(?sm)(?<=(?=.*?pig)(?=(?>(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*).*)pig(?=[^:]+(?(1)\1):(\d+))
Free-Spacing Version with Comments (Perl / PCRE Version):
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# for each line skipped, let Group 1 match an ever increasing portion of the numbers string at the bottom
(?= # lookahead
[^:]+ # skip all chars that are not colons
( # start Group 1
(?(1)\1) # match Group 1 if set
:\d+ # match a colon and some digits
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop everything we've matched so far
pig # match pig (this is the match!)
(?=[^:]+(?(1)\1):(\d+)) # capture the next number to Group 2
Replace:
\2
Output:
my cat
dog
my 3
my cow
my mouse
:1:2:3:4:5:6:7
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Choice of Delimiter for Digits
In our example, the delimiter : for the string of digits is rather common, and could happen elsewhere. We can invent a UNIQUE_DELIMITER and tweak the expression slightly. But the following optimization is even more efficient and lets us keep the :
Optimization on Second Solution: Reverse String of Digits
Instead of pasting our digits in order, it may be to our benefit to use them in the reverse order: :7:6:5:4:3:2:1
In our lookaheads, this allows us to get down to the bottom of the input with a simple .*, and to start backtracking from there. Since we know we're at the end of the string, we don't have to worry about the :digits being part of another section of the string. Here's how to do it.
Input:
my cat pi g
dog p ig
my pig
my cow
my mouse
:7:6:5:4:3:2:1
Search:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line that doesn't have pig
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# Group 1 matches increasing portion of the numbers string at the bottom
(?= # lookahead
.* # get to the end of the input
( # start Group 1
:\d+ # match a colon and some digits
(?(1)\1) # match Group 1 if set
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop match so far
pig # match pig (this is the match!)
(?=.*(\d+)(?(1)\1)) # capture the next number to Group 2
Replace: \2
See the substitutions in the demo.
Third Solution: Balancing Groups
This solution is specific to .NET.
Search:
(?m)(?<=\A(?<c>^(?:(?!pig)[^\r\n])*(?:\r?\n))*.*?)pig(?=[^:]+(?(c)(?<-c>:\d+)*):(\d+))
Free-Spacing Version with Comments:
(?xm) # free-spacing, multi-line
(?<= # lookbehind
\A #
(?<c> # skip one line that doesn't have pig
# The length of Group c Captures will serve as a counter
^ # beginning of line
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
) # end skipper
* # repeat skipper
.*? # we're on the pig line: lazily match chars before pig
) # end lookbehind
pig # match pig: this is the match
(?= # lookahead
[^:]+ # get to the digits
(?(c) # if Group c has been set
(?<-c>:\d+) # decrement c while we match a group of digits
* # repeat: this will only repeat as long as the length of Group c captures > 0
) # end if Group c has been set
:(\d+) # Match the next digit group, capture the digits
) # end lokahead
Replace: $1
Reference
Qtax trick
On Which Line Number Was the Regex Match Found?
Because you didn't specify which text editor, in vim it would be:
:%s/searched_word/\=printf('%-4d', line('.'))/g (read more)
But as somebody mentioned it's not a question for SO but rather Super User ;)
I don't know of an editor that does that short of extending an editor that allows arbitrary extensions.
You could easily use perl to do the task, though.
perl -i.bak -e"s/word/$./eg" file
Or if you want to use wildcards,
perl -MFile::DosGlob=glob -i.bak -e"BEGIN { #ARGV = map glob($_), #ARGV } s/word/$./eg" *.txt
This is a crossword problem. Example:
the solution is a 6-letter word which starts with "r" and ends with "r"
thus the pattern is "r....r"
the unknown 4 letters must be drawn from the pool of letters "a", "e", "i" and "p"
each letter must be used exactly once
we have a large list of candidate 6-letter words
Solutions: "rapier" or "repair".
Filtering for the pattern "r....r" is trivial, but finding words which also have [aeip] in the "unknown" slots is beyond me.
Is this problem amenable to a regex, or must it be done by exhaustive methods?
Try this:
r(?:(?!\1)a()|(?!\2)e()|(?!\3)i()|(?!\4)p()){4}r
...or more readably:
r
(?:
(?!\1) a () |
(?!\2) e () |
(?!\3) i () |
(?!\4) p ()
){4}
r
The empty groups serve as check marks, ticking off each letter as it's consumed. For example, if the word to be matched is repair, the e will be the first letter matched by this construct. If the regex tries to match another e later on, that alternative won't match it. The negative lookahead (?!\2) will fail because group #2 has participated in the match, and never mind that it didn't consume anything.
What's really cool is that it works just as well on strings that contain duplicate letters. Take your redeem example:
r
(?:
(?!\1) e () |
(?!\2) e () |
(?!\3) e () |
(?!\4) d ()
){4}
m
After the first e is consumed, the first alternative is effectively disabled, so the second alternative takes it instead. And so on...
Unfortunately, this technique doesn't work in all regex flavors. For one thing, they don't all treat empty/failed group captures the same. The ECMAScript spec explicitly states that references to non-participating groups should always succeed.
The regex flavor also has to support forward references--that is, backreferences that appear before the groups they refer to in the regex. (ref) It should work in .NET, Java, Perl, PCRE and Ruby, that I know of.
Assuming that you meant that the unknown letters must be among [aeip], then a suitable regex could be:
/r[aeip]{4,4}r/
What's the front end language being used to compare strings, is it java, .net ...
here is an example/psuedo code using java
String mandateLetters = "aeio"
String regPattern = "\\br["+mandateLetters+"]*r$"; // or if for specific length \\br[+mandateLetters+]{4}r$
Pattern pattern = Pattern.compile(regPattern);
Matcher matcher = pattern.matcher("is this repair ");
matcher.find();
Why not replace each '.' in your original pattern with '[aeip]'?
You'd wind up with a regex string r[aeip][aeip][aeip][aeip]r.
This could of course be shortened to r[aeip]{4,4}r, but that would be a pain to implement in the general case and probably wouldn't improve the code any.
This doesn't address the issue of duplicate letter use. If I were coding it, I'd handle that in code outside the regexp - mostly because the regexp would get more complicated than I'd care to handle.
So the "only once" part is the critical thing. Listing all permutations is obviously not feasible. If your language/environment supports lookaheads and backreferences you can make it a bit easier for yourself:
r(?=[aeip]{4,4})(.)(?!\1)(.)(?!\1|\2)(.)(?!\1|\2|\3).r
Still quite ugly, but here is how it works:
r # match an r
(?= # positive lookahead (doesn't advance position of "cursor" in input string)
[aeip]{4,4}
) # make sure that there are the four desired character ahead
(.) # match any character and capture it in group 1
(?!\1)# make sure that the next character is NOT the same as the previous one
(.) # match any character and capture it in group 2
(?!\1|\2)
# make sure that the next character is neither the first nor the second
(.) # match any character and capture it in group 3
(?!\1|\2|\3)
# same thing again for all three characters
. # match another arbitrary character
r # match an r
Working demo.
This is neither really elegant nor scalable. So you might just want to use r([aiep]{4,4})r (capturing the four critical letters) and ensure the additional condition without regex.
EDIT: In fact, the above pattern is only really useful and necessary if you just want to ensure that you have 4 non-identical characters. For your specific case, again using lookaheads, there is simpler (despite longer) solution:
r(?=[^a]*a[^a]*r)(?=[^e]*e[^e]*r)(?=[^i]*i[^i]*r)(?=[^p]*p[^p]*r)[aeip]{4,4}r
Explained:
r # match an r
(?= # lookahead: ensure that there is exactly one a until the next r
[^a]* # match an arbitrary amount of non-a characters
a # match one a
[^a]* # match an arbitrary amount of non-a characters
r # match the final r
) # end of lookahead
(?=[^e]*e[^e]*r) # ensure that there is exactly one e until the next r
(?=[^i]*i[^i]*r) # ensure that there is exactly one i until the next r
(?=[^p]*p[^p]*r) # ensure that there is exactly one p until the next r
[aeip]{4,4}r # actually match the rest to include it in the result
Working demo.
For r....m with a pool of deee, this could be adjusted as:
r(?=[^d]*d[^d]*m)(?=[^e]*(?:e[^e])*{3,3}m)[de]{4,4}m
This ensures that there is exactly one d and exactly 3 es.
Working demo.
not fully regex due to sed multi regex action
sed -n -e '/^r[aiep]\{4,\}r$/{/\([aiep]\).*\1/!p;}' YourFile
take pattern 4 letter in group aeipsourround by r, keep only line where no letter in the sub group is found twice.
A more scalable solution (no need to write \1, \2, \3 and so on for each letter or position) is to use negative lookahead to assert that each character is not occurring later:
^r(?:([aeip])(?!.*\1)){4}r$
more readable as:
^r
(?:
([aeip])
(?!.*\1)
){4}
r$
Improvements
This was a quick solution which works in the situation you gave us, but here are some additional constraints to have a robuster version:
If the "pool of letters" may share some letters with the end of string, include the end of pattern in the lookahead:
^r(?:([aeip])(?!.*\1.*\2)){4}(r$)
(may not work as intended in all regex flavors, in which case copy-paste the end of pattern instead of using \2)
If some letters must be present not only once but a different fixed number of times, add a separate lookahead for all letters sharing this number of times. For instance, "r....r" with one "a" and one "p" but two "e" would be matched by this regex (but "rapper" and "repeer" wouldn't):
^r(?:([ap])(?!.*\1.*\3)|([e])(?!.*\2.*\2.*\3)){4}(r$)
The non-capturing groups now has 2 alternatives: ([ap])(?!.*\1.*\3) which matches "a" or "p" not followed anywhere until ending by another one, and ([e])(?!.*\2.*\2.*\3) which matches "e" not followed anywhere until ending by 2 other ones (so it fails on the first one if there are 3 in total).
BTW this solution includes the above one, but the end of pattern is here shifted to \3 (also, cf. note about flavors).