This is the second part of a series of educational regex articles. It shows how lookaheads and nested references can be used to match the non-regular languge anbn. Nested references are first introduced in: How does this regex find triangular numbers?
One of the archetypal non-regular languages is:
L = { anbn: n > 0 }
This is the language of all non-empty strings consisting of some number of a's followed by an equal number of b's. Examples of strings in this language are ab, aabb, aaabbb.
This language can be show to be non-regular by the pumping lemma. It is in fact an archetypal context-free language, which can be generated by the context-free grammar S → aSb | ab.
Nonetheless, modern day regex implementations clearly recognize more than just regular languages. That is, they are not "regular" by formal language theory definition. PCRE and Perl supports recursive regex, and .NET supports balancing groups definition. Even less "fancy" features, e.g. backreference matching, means that regex is not regular.
But just how powerful is this "basic" features? Can we recognize L with Java regex, for example? Can we perhaps combine lookarounds and nested references and have a pattern that works with e.g. String.matches to match strings like ab, aabb, aaabbb, etc?
References
perlfaq6: Can I use Perl regular expressions to match balanced text?
MSDN - Regular Expression Language Elements - Balancing Group Definitions
pcre.org - PCRE man page
regular-expressions.info - Lookarounds and Grouping and Backreferences
java.util.regex.Pattern
Linked questions
Does lookaround affect which languages can be matched by regular expressions?
.NET Regex Balancing Groups vs PCRE Recursive Patterns
The answer is, needless to say, YES! You can most certainly write a Java regex pattern to match anbn. It uses a positive lookahead for assertion, and one nested reference for "counting".
Rather than immediately giving out the pattern, this answer will guide readers through the process of deriving it. Various hints are given as the solution is slowly constructed. In this aspect, hopefully this answer will contain much more than just another neat regex pattern. Hopefully readers will also learn how to "think in regex", and how to put various constructs harmoniously together, so they can derive more patterns on their own in the future.
The language used to develop the solution will be PHP for its conciseness. The final test once the pattern is finalized will be done in Java.
Step 1: Lookahead for assertion
Let's start with a simpler problem: we want to match a+ at the beginning of a string, but only if it's followed immediately by b+. We can use ^ to anchor our match, and since we only want to match the a+ without the b+, we can use lookahead assertion (?=…).
Here is our pattern with a simple test harness:
function testAll($r, $tests) {
foreach ($tests as $test) {
$isMatch = preg_match($r, $test, $groups);
$groupsJoined = join('|', $groups);
print("$test $isMatch $groupsJoined\n");
}
}
$tests = array('aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb');
$r1 = '/^a+(?=b+)/';
# └────┘
# lookahead
testAll($r1, $tests);
The output is (as seen on ideone.com):
aaa 0
aaab 1 aaa
aaaxb 0
xaaab 0
b 0
abbb 1 a
This is exactly the output we want: we match a+, only if it's at the beginning of the string, and only if it's immediately followed by b+.
Lesson: You can use patterns in lookarounds to make assertions.
Step 2: Capturing in a lookahead (and f r e e - s p a c i n g mode)
Now let's say that even though we don't want the b+ to be part of the match, we do want to capture it anyway into group 1. Also, as we anticipate having a more complicated pattern, let's use x modifier for free-spacing so we can make our regex more readable.
Building on our previous PHP snippet, we now have the following pattern:
$r2 = '/ ^ a+ (?= (b+) ) /x';
# │ └──┘ │
# │ 1 │
# └────────┘
# lookahead
testAll($r2, $tests);
The output is now (as seen on ideone.com):
aaa 0
aaab 1 aaa|b
aaaxb 0
xaaab 0
b 0
abbb 1 a|bbb
Note that e.g. aaa|b is the result of join-ing what each group captured with '|'. In this case, group 0 (i.e. what the pattern matched) captured aaa, and group 1 captured b.
Lesson: You can capture inside a lookaround. You can use free-spacing to enhance readability.
Step 3: Refactoring the lookahead into the "loop"
Before we can introduce our counting mechanism, we need to do one modification to our pattern. Currently, the lookahead is outside of the + repetition "loop". This is fine so far because we just wanted to assert that there's a b+ following our a+, but what we really want to do eventually is assert that for each a that we match inside the "loop", there's a corresponding b to go with it.
Let's not worry about the counting mechanism for now and just do the refactoring as follows:
First refactor a+ to (?: a )+ (note that (?:…) is a non-capturing group)
Then move the lookahead inside this non-capturing group
Note that we must now "skip" a* before we can "see" the b+, so modify the pattern accordingly
So we now have the following:
$r3 = '/ ^ (?: a (?= a* (b+) ) )+ /x';
# │ │ └──┘ │ │
# │ │ 1 │ │
# │ └───────────┘ │
# │ lookahead │
# └───────────────────┘
# non-capturing group
The output is the same as before (as seen on ideone.com), so there's no change in that regard. The important thing is that now we are making the assertion at every iteration of the + "loop". With our current pattern, this is not necessary, but next we'll make group 1 "count" for us using self-reference.
Lesson: You can capture inside a non-capturing group. Lookarounds can be repeated.
Step 4: This is the step where we start counting
Here's what we're going to do: we'll rewrite group 1 such that:
At the end of the first iteration of the +, when the first a is matched, it should capture b
At the end of the second iteration, when another a is matched, it should capture bb
At the end of the third iteration, it should capture bbb
...
At the end of the n-th iteration, group 1 should capture bn
If there aren't enough b to capture into group 1 then the assertion simply fails
So group 1, which is now (b+), will have to be rewritten to something like (\1 b). That is, we try to "add" a b to what group 1 captured in the previous iteration.
There's a slight problem here in that this pattern is missing the "base case", i.e. the case where it can match without the self-reference. A base case is required because group 1 starts "uninitialized"; it hasn't captured anything yet (not even an empty string), so a self-reference attempt will always fail.
There are many ways around this, but for now let's just make the self-reference matching optional, i.e. \1?. This may or may not work perfectly, but let's just see what that does, and if there's any problem then we'll cross that bridge when we come to it. Also, we'll add some more test cases while we're at it.
$tests = array(
'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb'
);
$r4 = '/ ^ (?: a (?= a* (\1? b) ) )+ /x';
# │ │ └─────┘ | │
# │ │ 1 | │
# │ └──────────────┘ │
# │ lookahead │
# └──────────────────────┘
# non-capturing group
The output is now (as seen on ideone.com):
aaa 0
aaab 1 aaa|b # (*gasp!*)
aaaxb 0
xaaab 0
b 0
abbb 1 a|b # yes!
aabb 1 aa|bb # YES!!
aaabbbbb 1 aaa|bbb # YESS!!!
aaaaabbb 1 aaaaa|bb # NOOOOOoooooo....
A-ha! It looks like we're really close to the solution now! We managed to get group 1 to "count" using self-reference! But wait... something is wrong with the second and the last test cases!! There aren't enough bs, and somehow it counted wrong! We'll examine why this happened in the next step.
Lesson: One way to "initialize" a self-referencing group is to make the self-reference matching optional.
Step 4½: Understanding what went wrong
The problem is that since we made the self-reference matching optional, the "counter" can "reset" back to 0 when there aren't enough b's. Let's closely examine what happens at every iteration of our pattern with aaaaabbb as input.
a a a a a b b b
↑
# Initial state: Group 1 is "uninitialized".
_
a a a a a b b b
↑
# 1st iteration: Group 1 couldn't match \1 since it was "uninitialized",
# so it matched and captured just b
___
a a a a a b b b
↑
# 2nd iteration: Group 1 matched \1b and captured bb
_____
a a a a a b b b
↑
# 3rd iteration: Group 1 matched \1b and captured bbb
_
a a a a a b b b
↑
# 4th iteration: Group 1 could still match \1, but not \1b,
# (!!!) so it matched and captured just b
___
a a a a a b b b
↑
# 5th iteration: Group 1 matched \1b and captured bb
#
# No more a, + "loop" terminates
A-ha! On our 4th iteration, we could still match \1, but we couldn't match \1b! Since we allow the self-reference matching to be optional with \1?, the engine backtracks and took the "no thanks" option, which then allows us to match and capture just b!
Do note, however, that except on the very first iteration, you could always match just the self-reference \1. This is obvious, of course, since it's what we just captured on our previous iteration, and in our setup we can always match it again (e.g. if we captured bbb last time, we're guaranteed that there will still be bbb, but there may or may not be bbbb this time).
Lesson: Beware of backtracking. The regex engine will do as much backtracking as you allow until the given pattern matches. This may impact performance (i.e. catastrophic backtracking) and/or correctness.
Step 5: Self-possession to the rescue!
The "fix" should now be obvious: combine optional repetition with possessive quantifier. That is, instead of simply ?, use ?+ instead (remember that a repetition that is quantified as possessive does not backtrack, even if such "cooperation" may result in a match of the overall pattern).
In very informal terms, this is what ?+, ? and ?? says:
?+
(optional) "It doesn't have to be there,"
(possessive) "but if it is there, you must take it and not let go!"
?
(optional) "It doesn't have to be there,"
(greedy) "but if it is you can take it for now,"
(backtracking) "but you may be asked to let it go later!"
??
(optional) "It doesn't have to be there,"
(reluctant) "and even if it is you don't have to take it just yet,"
(backtracking) "but you may be asked to take it later!"
In our setup, \1 will not be there the very first time, but it will always be there any time after that, and we always want to match it then. Thus, \1?+ would accomplish exactly what we want.
$r5 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ /x';
# │ │ └──────┘ │ │
# │ │ 1 │ │
# │ └───────────────┘ │
# │ lookahead │
# └───────────────────────┘
# non-capturing group
Now the output is (as seen on ideone.com):
aaa 0
aaab 1 a|b # Yay! Fixed!
aaaxb 0
xaaab 0
b 0
abbb 1 a|b
aabb 1 aa|bb
aaabbbbb 1 aaa|bbb
aaaaabbb 1 aaa|bbb # Hurrahh!!!
Voilà!!! Problem solved!!! We are now counting properly, exactly the way we want it to!
Lesson: Learn the difference between greedy, reluctant, and possessive repetition. Optional-possessive can be a powerful combination.
Step 6: Finishing touches
So what we have right now is a pattern that matches a repeatedly, and for every a that was matched, there is a corresponding b captured in group 1. The + terminates when there are no more a, or if the assertion failed because there isn't a corresponding b for an a.
To finish the job, we simply need to append to our pattern \1 $. This is now a back reference to what group 1 matched, followed by the end of the line anchor. The anchor ensures that there aren't any extra b's in the string; in other words, that in fact we have anbn.
Here's the finalized pattern, with additional test cases, including one that's 10,000 characters long:
$tests = array(
'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb',
'', 'ab', 'abb', 'aab', 'aaaabb', 'aaabbb', 'bbbaaa', 'ababab', 'abc',
str_repeat('a', 5000).str_repeat('b', 5000)
);
$r6 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ \1 $ /x';
# │ │ └──────┘ │ │
# │ │ 1 │ │
# │ └───────────────┘ │
# │ lookahead │
# └───────────────────────┘
# non-capturing group
It finds 4 matches: ab, aabb, aaabbb, and the a5000b5000. It takes only 0.06s to run on ideone.com.
Step 7: The Java test
So the pattern works in PHP, but the ultimate goal is to write a pattern that works in Java.
public static void main(String[] args) {
String aNbN = "(?x) (?: a (?= a* (\\1?+ b)) )+ \\1";
String[] tests = {
"", // false
"ab", // true
"abb", // false
"aab", // false
"aabb", // true
"abab", // false
"abc", // false
repeat('a', 5000) + repeat('b', 4999), // false
repeat('a', 5000) + repeat('b', 5000), // true
repeat('a', 5000) + repeat('b', 5001), // false
};
for (String test : tests) {
System.out.printf("[%s]%n %s%n%n", test, test.matches(aNbN));
}
}
static String repeat(char ch, int n) {
return new String(new char[n]).replace('\0', ch);
}
The pattern works as expected (as seen on ideone.com).
And now we come to the conclusion...
It needs to be said that the a* in the lookahead, and indeed the "main + loop", both permit backtracking. Readers are encouraged to confirm why this is not a problem in terms of correctness, and why at the same time making both possessive would also work (though perhaps mixing mandatory and non-mandatory possessive quantifier in the same pattern may lead to misperceptions).
It should also be said that while it's neat that there's a regex pattern that will match anbn, this is in not always the "best" solution in practice. A much better solution is to simply match ^(a+)(b+)$, and then compare the length of the strings captured by groups 1 and 2 in the hosting programming language.
In PHP, it may look something like this (as seen in ideone.com):
function is_anbn($s) {
return (preg_match('/^(a+)(b+)$/', $s, $groups)) &&
(strlen($groups[1]) == strlen($groups[2]));
}
The purpose of this article is NOT to convince readers that regex can do almost anything; it clearly can't, and even for the things it can do, at least partial delegation to the hosting language should be considered if it leads to a simpler solution.
As mentioned at the top, while this article is necessarily tagged [regex] for stackoverflow, it is perhaps about more than that. While certainly there's value in learning about assertions, nested references, possessive quantifier, etc, perhaps the bigger lesson here is the creative process by which one can try to solve problems, the determination and hard work that it often requires when you're subjected to various constraints, the systematic composition from various parts to build a working solution, etc.
Bonus material! PCRE recursive pattern!
Since we did bring up PHP, it needs to be said that PCRE supports recursive pattern and subroutines. Thus, following pattern works for preg_match (as seen on ideone.com):
$rRecursive = '/ ^ (a (?1)? b) $ /x';
Currently Java's regex does not support recursive pattern.
Even more bonus material! Matching anbncn !!
So we've seen how to match anbn which is non-regular, but still context-free, but can we also match anbncn, which isn't even context-free?
The answer is, of course, YES! Readers are encouraged to try to solve this on their own, but the solution is provided below (with implementation in Java on ideone.com).
^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $
Given that no mention has been made of PCRE supporting recursive patterns, I'd just like to point out the simplest and most efficient example of PCRE that describes the language in question:
/^(a(?1)?b)$/
As mentioned in the question — with .NET balancing group, the patterns of the type anbncndn…zn can be matched easily as
^
(?<A>a)+
(?<B-A>b)+ (?(A)(?!))
(?<C-B>c)+ (?(B)(?!))
...
(?<Z-Y>z)+ (?(Y)(?!))
$
For example: http://www.ideone.com/usuOE
Edit:
There is also a PCRE pattern for the generalized language with recursive pattern, but a lookahead is needed. I don't think this is a direct translation of the above.
^
(?=(a(?-1)?b)) a+
(?=(b(?-1)?c)) b+
...
(?=(x(?-1)?y)) x+
(y(?-1)?z)
$
For example: http://www.ideone.com/9gUwF
Related
Regex - should match newlines as well as should end at the first occurrence of a particular format
In reference to Regex - should match newlines as well as should end at the first occurence of a particular format
I am trying to read body of the mail from logs (some of them are more than 500 lines).
Sample data looks like: BodyOftheMail_Script = [ BEGIN 500 lines END ]
I've tried following regular expressions:
+-----------------------------------------------------------------------+----------+--------+
| Regexp | Steps | Time |
+-----------------------------------------------------------------------+----------+--------+
| BodyOftheMail_Script\s=\s[\sBEGIN\s{0,}((?s)[\s\S]*?)(?=\s{1,}END\s]) | 1015862 | ~474ms |
| BodyOftheMail_Script\s=\s[\sBEGIN\s{0,}((?s)[\w\W]*?)(?=\s{1,}END\s]) | 1015862 | ~480ms |
| BodyOftheMail_Script\s=\s[\sBEGIN\s{0,}((?s).*?)(?=\s{1,}END\s]) | 1015862 | ~577ms |
| BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((.|\n)*?)(?=\s{1,}END\s\]) | 1681711 | ~829ms |
+-----------------------------------------------------------------------+----------+--------+
Is there a faster way (more optimal regexp) to match this?
Enhancing the pattern
The most efficient from 5 expressions turned out to be
BodyOftheMail_Script\s=\s\[\sBEGIN\s*(\S*(?:\s++(?!END\s])\S*)*)\s+END\s]
See the regex demo
The part I modified is \S*(?:\s++(?!END\s])\S*)*:
\S* - 0 or more non-whitespace characters
(?:\s++(?!END\s])\S*)* - 0 or more occurrences of
\s++(?!END\s]) - 1+ whitespace characters (matched possessively so that the lookahead check could only be performed once after all the 1+ whitespaces are matched) not followed with END, 1 whitespace and ] char
\S* - 0 or more non-whitespace characters
Why not a mere BodyOftheMail_Script\s=\s\[\sBEGIN\s*(.*?)\s+END\s] with re.DOTALL? The \s*(.*?)\s+END\s] will work as follows: 0+ whitespaces will be matched at once, then (.*?) will be skipped the first time, then \s+END\s] pattern will be tried. If \s+END\s] is not matched, .*? will grab one char and again let the subsequent patterns try to match the string. And so on. It might take a lot of backtracking steps to reach the end of a match (if it is there, else, it might end in a timeout sooner than later).
Performance comparison
Since the number of steps at regex101.com is not a direct proof a certain pattern is more efficient than another, I decided to run performance tests using Python PyPi regex library. See the code below.
The results obtained on a PC with 16GB RAM, Intel Core i5-9400F CPU, consistent results are obtained using PyPi regex versions 2.5.77 and 2.5.82:
┌──────────┬─────────────────────────────────────────────────────────────────┐
│ Regex │ Time taken │
├──────────┼─────────────────────────────────────────────────────────────────┤
│ OP 1 │ 0.5606743000000001 │
│ OP 2 │ 0.5524994999999999 │
│ OP 3 │ 0.5026944 │
│ OP 4 │ 0.7502984000000001 │
│ WS_1 │ 0.25729479999999993 │
│ WS_2 │ 0.3680949 │
└──────────┴─────────────────────────────────────────────────────────────────┘
Conclusions:
The worst OP regex is the one that contains a notorious (.|\n)*? pattern, it is one of the most inefficient patterns I have seen in my regex life, it always causes issues across all languages. Please never use it in your patterns
The first three OP patterns are comparable, but it is clear than the common workarounds for a . to match any char, [\w\W] and [\s\S], should be avoided if there is a way to make . match any char with a modifier, such as (?s) or regex.DOTALL. The (?s). native solution is a tiny bit more efficient.
My suggestion appears to be twice as fast comapring to the best OP pattern due to the fact it matches strings from left-hand delimiter to the right-hand delimiter in chunks, only stopping to check for the right-hand delimiter after grabbing whitespace chunks of text and the whitespaces that follow them.
The .*? construct is expanding each time a char is not the start of the right-hand delimiter, with longer strings, its efficiency will be decreasing.
The Python testing code:
import regex, timeit
text = 'BodyOftheMail_Script = [ BEGIN some text\nhere and\nhere, too \nEND ]'
regex_pattern_1=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((?s)[\s\S]*?)(?=\s{1,}END\s])')
regex_pattern_2=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((?s)[\w\W]*?)(?=\s{1,}END\s])')
regex_pattern_3=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((?s).*?)(?=\s{1,}END\s])')
regex_pattern_4=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((.|\n)*?)(?=\s{1,}END\s\])')
regex_pattern_WS_1=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s*(\S*(?:\s++(?!END\s])\S*)*)\s+END\s]')
regexp_patternWS_2 = regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s*(.*?)\s+END\s]', regex.DOTALL)
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_1 as p', number=100000))
# => 0.5606743000000001
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_2 as p', number=100000))
# => 0.5524994999999999
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_3 as p', number=100000))
# => 0.5026944
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_4 as p', number=100000))
# => 0.7502984000000001
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_WS_1 as p', number=100000))
# => 0.25729479999999993
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regexp_patternWS_2 as p', number=100000))
# => 0.3680949
Unless you missed some important details in your question, I don't see any reason to overcomplicate the things. Why not use simple BodyOftheMail_Script = \[ BEGIN.*?END \]? So you have your start indicator BodyOftheMail_Script = [ BEGIN, you have end indicator END ], and you want to match everything in between in non-greedy way .*?. Of course it requires flags like re.MULTILINE and re.DOTALL (if we're talking about Python):
import re
regexp = re.compile(r'BodyOftheMail_Script = \[ BEGIN.*?END \]', re.DOTALL | re.MULTILINE)
The first rule of regexps - do not overcomplicate ;) Someone will read it after you.
Using the same comparison script as in #Wictor's answer, I got following results:
OP 1 0.24152620000000002
OP 2 0.28501820000000005
OP 3 0.20582650000000002
OP 4 0.3379188999999999
WS 0.16937669999999994
Subj 0.10387990000000014
Replacing to \s is possible and it does not really change the speed (but if you have only space in the actual file, then just use space, do not overcomplicate)
Also if you want, you can add the group to directly get the content, it adds ~0.02s for me, most probably it will be faster to trim each result afterwards instead of using regexp group.
There is a problem that I need to do, but there are some caveats that make it hard.
Problem: Match on all non-empty strings over the alphabet {abc} that contain at most one a.
Examples
a
abc
bbca
bbcabb
Nonexample
aa
bbaa
Caveats: You cannot use a lookahead/lookbehind.
What I have is this:
^[bc]*a?[bc]*$
but it matches empty strings. Maybe a hint? Idk anything would help
(And if it matters, I'm using python).
As I understand your question, the only problem is, that your current pattern matches empty strings. To prevent this you can use a word boundary \b to require at least one word character.
^\b[bc]*a?[bc]*$
See demo at regex101
Another option would be to alternate in a group. Match an a surrounded by any amount of [bc] or one or more [bc] from start to end which could look like: ^(?:[bc]*a[bc]*|[bc]+)$
The way I understood the issue was that any character in the alphabet should match, just only one a character.
Match on all non-empty strings over the alphabet... at most one a
^[b-z]*a?[b-z]*$
If spaces can be included:
^([b-z]*\s?)*a?([b-z]*\s?)*$
You do not even need a regex here, you might as well use .count() and a list comprehension:
data = """a,abc,bbca,bbcabb,aa,bbaa,something without the bespoken letter,ooo"""
def filter(string, char):
return [word
for word in string.split(",")
for c in [word.count(char)]
if c in [0,1]]
print(filter(data, 'a'))
Yielding
['a', 'abc', 'bbca', 'bbcabb', 'something without the bespoken letter', 'ooo']
You've got to positively match something excluding the empty string,
using only a, b, or c letters. But can't use assertions.
Here is what you do.
The regex ^(?:[bc]*a[bc]*|[bc]+)$
The explanation
^ # BOS
(?: # Cluster choice
[bc]* a [bc]* # only 1 [a] allowed, arbitrary [bc]'s
| # or,
[bc]+ # no [a]'s only [bc]'s ( so must be some )
) # End cluster
$ # EOS
I am trying to parse a csv file, and I am trying to access names regex in proto regex in Perl6. It turns out to be Nil. What is the proper way to do it?
grammar rsCSV {
regex TOP { ( \s* <oneCSV> \s* \, \s* )* }
proto regex oneCSV {*}
regex oneCSV:sym<noQuote> { <-[\"]>*? }
regex oneCSV:sym<quoted> { \" .*? \" } # use non-greedy match
}
my $input = prompt("Enter csv line: ");
my $m1 = rsCSV.parse($input);
say "===========================";
say $m1;
say "===========================";
say "1 " ~ $m1<oneCSV><quoted>; # this fails; it is "Nil"
say "2 " ~ $m1[0];
say "3 " ~ $m1[0][2];
Detailed discussion complementing Christoph's answer
I am trying to parse a csv file
Perhaps you are focused on learning Raku parsing and are writing some throwaway code. But if you want industrial strength CSV parsing out of the box, please be aware of the Text::CSV modules[1].
I am trying to access a named regex
If you are learning Raku parsing, please take advantage of the awesome related (free) developer tools[2].
in proto regex in Raku
Your issue is unrelated to it being a proto regex.
Instead the issue is that, while the match object corresponding to your named capture is stored in the overall match object you stored in $m1, it is not stored precisely where you are looking for it.
Where do match objects corresponding to captures appear?
To see what's going on, I'll start by simulating what you were trying to do. I'll use a regex that declares just one capture, a "named" (aka "Associative") capture that matches the string ab.
given 'ab'
{
my $m1 = m/ $<named-capture> = ( ab ) /;
say $m1<named-capture>;
# 「ab」
}
The match object corresponding to the named capture is stored where you'd presumably expect it to appear within $m1, at $m1<named-capture>.
But you were getting Nil with $m1<oneCSV>. What gives?
Why your $m1<oneCSV> did not work
There are two types of capture: named (aka "Associative") and numbered (aka "Positional"). The parens you wrote in your regex that surrounded <oneCSV> introduced a numbered capture:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) ) /; # extra parens added
say $m1[0]<named-capture>;
# 「ab」
}
The parens in / ( ... ) / declare a single top level numbered capture. If it matches, then the corresponding match object is stored in $m1[0]. (If your regex looked like / ... ( ... ) ... ( ... ) ... ( ... ) ... / then another match object corresponding to what matches the second pair of parentheses would be stored in $m1[1], another in $m1[2] for the third, and so on.)
The match result for $<named-capture> = ( ab ) is then stored inside $m1[0]. That's why say $m1[0]<named-capture> works.
So far so good. But this is only half the story...
Why $m1[0]<oneCSV> in your code would not work either
While $m1[0]<named-capture> in the immediately above code is working, you would still not get a match object in $m1[0]<oneCSV> in your original code. This is because you also asked for multiple matches of the zeroth capture because you used a * quantifier:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) )* /; # * is a quantifier
say $m1[0][0]<named-capture>;
# 「ab」
}
Because the * quantifier asks for multiple matches, Raku writes a list of match objects into $m1[0]. (In this case there's only one such match so you end up with a list of length 1, i.e. just $m1[0][0] (and not $m1[0][1], $m1[0][2], etc.).)
Summary
Captures nest;
A capture quantified by either * or + corresponds to two levels of nesting not just one.
In your original code, you'd have to write say $m1[0][0]<oneCSV>; to get to the match object you're looking for.
[1] Install relevant modules and write use Text::CSV; (for a pure Raku implementation) or use Text::CSV:from<Perl5>; (for a Perl plus XS implementation) at the start of your code. (talk slides (click on top word, eg. "csv", to advance through slides), video, Raku module, Perl XS module.)
[2] Install CommaIDE and have fun with its awesome grammar/regex development/debugging/analysis features. Or install the Grammar::Tracer; and/or Grammar::Debugger modules and write use Grammar::Tracer; or use Grammar::Debugger; at the start of your code (talk slides, video, modules.)
The match for <oneCSV> lives within the scope of the capture group, which you get via $m1[0].
As the group is quantified with *, the results will again be a list, ie you need another indexing operation to get at a match object, eg $m1[0][0] for the first one.
The named capture can then be accessed by name, eg $m1[0][0]<oneCSV>. This will already contain the match result of the appropriate branch of the protoregex.
If you want the whole list of matches instead of a specific one, you can use >> or map, eg $m1[0]>>.<oneCSV>.
I would like to write this regex (later simplified form) in a more compact/ elegant/ systematic. PCRE or Python (newer engine) preferred. Shortly, I would like to capture each artery name (iliac, femoral, popliteal and so on), regardless of the string between them . Ideally, the resulted regex won't depend on any kind of regex flavor.
LE2: Even more simplified regex, but not working correctly: https://www.regex101.com/r/cK5wB6/7. I've eliminated DEFINE section - this was added only for modularity purposes, and DEFINE is not compatible with Python anyway (newer, v1 engine added this feature). I want capture all the artery names, equivalent of getting a vector of all artery names, regardless of number of names, or strings between them.
(arteries:.{0,25}?)
((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*
(?<artfinal>(?&art))
The problem is that some arteries are still not recognized correctly (at least visually). I'm trying to capture those names, without explicitly write capturing groups like in this.
LE4: The last variant actually ignore all names, aside the 1st and the last two.
First, regex flavour independent pattern is a myth. Regex engines are different, have different features, and even a same pattern that only uses common tokens between two or more regex engines can return different results.
An example with the Python regex module that has interesting features like the ability to use a set (\L<arteries> in the pattern) and the ability to store repeated capture groups:
import regex
s = '''arteries: jhjh iliac jdfd femoral
arteries: sdsdsd iliac jdfd femoral fd d popliteal
arteries: hgv popliteal,sddsdsds iliac tibial nkjkknperoneal nkjkkn
arteries: iliac, peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk
arteries:m bkjnkjnperoneal mm femoral jnnbn n right femoralbjkkbb jk'''
arteries_set = ['femoral', 'iliac', 'peroneal', 'tibial']
p = regex.compile(r'^arteries: (?: [^\w\n]* (?>\w+[^\w\n]+)*? (\L<arteries>) \M)+', regex.M | regex.I | regex.X, arteries=arteries_set)
for m in p.finditer(s):
print(m.captures(1))
I voluntary removed the "less than 25 characters" condition to build a more efficient pattern, but feel free to replace [^\w\n]* (?>\w+[^\w\n]+)*? with .{0,25}? \m
(\m and \M are word boundaries, respectively for the start and the end of a word)
I want capture all the artery names,
The problem is that some arteries are still not recognized correctly (at least visually)
The problem with this regex:
((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*
is that the group art continuously overwrites its capture with the last match. This is an expected behaviour by design.
I want to recognize all the arteries given in the list (see definition
of ) The same artery could appear 1...n to and in any position.
The strings between artery names could be anything of max. 25 chars.
As of flavors, let's stick with PCRE
Provided you're working with PCRE, instead of matching all occurences of arteries at once, I would suggest matching 1 artery at a time. And to achieve that, we can use \G to match at the end of last match.
Regex:
/\G # Match anchor (BoS or EoLastMatch)
(?:
(?!^) # With previous match
|
.*? # Or first occurence
arteries: # of arteries:
)
.{1,25}? # Separated by max 25 chars
(?P<art> # Group 1 (capture 1 artery)
\b # List of arteries
(?:iliac|femoral|popliteal|peroneal|tibial)
\b # in between word boundaries
# Modif: global, caseless, singleline, extra
)/gixs
This will capture each artery in group art (group 1).
DEMO
Notes about other flavours:
As for compatibility with other regex flavours, you could loop each match in your code to simulate \G (which is not implemented in almost any other flavour). Another option is to split the text with the expression:
(arteries:|\b(?:iliac|femoral|popliteal|peroneal|tibial)\b)
and then check the length of each token to guarantee there isn't more than 25 chars in between.
the code will have to be migrated to Python one (not so distant) day
Update: Migrating to Python:
You can use \G in Python since the regex module implemented it, but if you do use that module, take advantage if its ability to retrieve repeated captures from a group with the .captures method. Check #CasimiretHippolyte's answer, a perfect example of using captures in this case.
On the other hand, if you stick to the standard re module, I'd recommend looping each match to simulate the same behaviour.
Code:
import re
text = '''arteries: jhjh iliac jdfd femoral
arteries: sdsdsd iliac jdfd femoral fd d popliteal
some arteries: hgv popliteal,sddsdsds iliac tibial nkjkknperoneal nkjkkn
arteries: iliac, peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk
arteries:m bkjnkjnperoneal mm femoral jnnbn n right femoralbjkkbb jk'''
n = 0
pattern_from = re.compile( r'arteries:', re.I)
pattern_token = re.compile( r'.{1,25}?\b(iliac|femoral|popliteal|peroneal|tibial)\b', re.I)
for match_from in pattern_from.finditer(text):
n = n + 1
print( '\nMatch #%s:' % n, end="")
match_token = pattern_token.match( text, match_from.end())
while match_token:
print( '[%s:%s]="%s" ' % (match_token.start(1), match_token.end(1), match_token.group(1)), end="")
match_token = pattern_token.match( text, match_token.end())
Output:
Match #1:[15:20]="iliac" [26:33]="femoral"
Match #2:[52:57]="iliac" [63:70]="femoral" [76:85]="popliteal"
Match #3:[106:115]="popliteal" [125:130]="iliac" [132:138]="tibial"
Match #4:[171:176]="iliac" [179:187]="peroneal"
Match #5:[243:251]="peroneal" [257:264]="femoral" [267:273]="tibial" [282:288]="tibial"
Match #6:[344:351]="femoral"
ideone DEMO
Given test <- c('met','meet','eel','elm'), I need a single line of code that matches any 'e' that is not in 'me' or 'ee'. I wrote (ee|me)(*SKIP)(*F)|e, which does exclude 'met' and 'eel', but not 'meet'. Is this because | is exclusive or? At any rate, is there a solution that just returns 'elm'?
For the record, I know I can also do (?<![me])e(?!e), but I would like to know what the solution is for (*SKIP)(*F) and why my line is wrong.
This is the correct solution with (*SKIP)(*F):
(?:me+|ee+)(*SKIP)(*FAIL)|e
Demo on regex101, using the following test cases:
met
meet
eel
elm
degree
zookeeper
meee
Only e in elm, first e in degree and last e in zookeeper are matched.
Since e in ee is forbidden, any e in after m is forbidden, and any e in a substring of consecutive e is forbidden. This explains the sub-pattern (?:me+|ee+).
While I am aware that this method is not extensible, it is at least logically correct.
Analysis of other solutions
Solution 0
(ee|me)(*SKIP)(*F)|e
Let's use meet as an example:
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
# Forbid backtracking to pattern to the left
# Set index of bump along advance to current position
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
# Pattern failed. No choice left. Bump along.
# Note that backtracking to before (*SKIP) is forbidden,
# so e in second branch is not tried
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
# Can't match ee or me. Try the other branch
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
# Found a match `e`
The problem is due to the fact that me consumes the first e, so ee fails to match, leaving the second e available for matching.
Solution 1
\w*(ee|me)\w*(*SKIP)(*FAIL)|e
This will just skips all words with ee and me, which means it will fail to match anything in degree and zookeeper.
Demo
Solution 2
(?:ee|mee?)(*SKIP)(?!)|e
Similar problem as solution 0. When there are 3 e in a row, the first 2 e are matched by mee?, leaving the third e available for matching.
Solution 3
(?:^.*[me]e)(*SKIP)(*FAIL)|e
This throws away the input up to the last me or ee, which means that any valid e before the last me or ee will not be matched, like first e in degree.
Demo
You need a preceding/following boundary forcing the regex engine to not retry the substring.
gsub('\\w*[em]e\\w*(*SKIP)(?!)|e', '', test, perl=T)
Or as #CasimiretHippolyte pointed out — preceding with an optional "e" ...
gsub('(?:ee|mee?)(*SKIP)(?!)|e', '', test, perl=T)
Updated per comments ( Use a quantifier (for other cases) ):
gsub('[em]e+(*SKIP)(?!)|e', '', test, perl=T)
Note: I decided to use (?!) instead of (*F) which is also used to force a regex to fail.
(?!) # equivalent to ( (*FAIL) or (*F) - both synonyms for (?!) ),
# causes matching failure, forcing backtracking to occur
Overall, the syntax can be written as (*SKIP)(*FAIL), (*SKIP)(*F) or (*SKIP)(?!)
You can add \w* in your first pattern to help the engine with more data, telling that ee or me can appear at the beginning, middle or end of a string.
You can use a regex like this:
\w*(ee|me)\w*(*SKIP)(*FAIL)|e
R regex would be,
> test <- c('met','meet','eel','elm')
> gsub("\\w*(?:ee|me)\\w*(*SKIP)(*FAIL)|e", "fi", perl=TRUE, test)
[1] "met" "meet" "eel" "film"
OR
> gsub('(?:^.*[me]e)(*SKIP)(*FAIL)|e', 'fi', test, perl=T)
[1] "met" "meet" "eel" "film"
Working demo