Grammar a bit too greedy in Perl6 - regex

I am having problems with this mini-grammar, which tries to match markdown-like header constructs.
role Like-a-word {
regex like-a-word { \S+ }
}
role Span does Like-a-word {
regex span { <like-a-word>[\s+ <like-a-word>]* }
}
grammar Grammar::Headers does Span {
token TOP {^ <header> \v+ $}
token hashes { '#'**1..6 }
regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}
}
I would like it to match ## Easier ## as a header, but instead it takes ## as part of span:
TOP
| header
| | hashes
| | * MATCH "##"
| | span
| | | like-a-word
| | | * MATCH "Easier"
| | | like-a-word
| | | * MATCH "##"
| | | like-a-word
| | | * FAIL
| | * MATCH "Easier ##"
| * MATCH "## Easier ##"
* MATCH "## Easier ##\n"
「## Easier ##
」
header => 「## Easier ##」
hashes => 「##」
span => 「Easier ##」
like-a-word => 「Easier」
like-a-word => 「##」
The problem is that the [\h* $0]? simply does not seem to work, with span gobbling up all available words. Any idea?

First, as others have pointed out, <hashes> does not capture into $0, but instead, it captures into $<hashes>, so you have to write:
regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}
But that still doesn't match the way you want, because the [\h* $<hashes>]? part happily matches zero occurrences.
The proper fix is to not let span match ## as a word:
role Like-a-word {
regex like-a-word { <!before '#'> \S+ }
}
role Span does Like-a-word {
regex span { <like-a-word>[\s+ <like-a-word>]* }
}
grammar Grammar::Headers does Span {
token TOP {^ <header> \v+ $}
token hashes { '#'**1..6 }
regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}
}
say Grammar::Headers.subparse("## Easier ##\n", :rule<header>);
If you are loath to modify like-a-word, you can also force the exclusion of a final # from it like this:
role Like-a-word {
regex like-a-word { \S+ }
}
role Span does Like-a-word {
regex span { <like-a-word>[\s+ <like-a-word>]* }
}
grammar Grammar::Headers does Span {
token TOP {^ <header> \v+ $}
token hashes { '#'**1..6 }
regex header {^^ <hashes> \h+ <span> <!after '#'> [\h* $<hashes>]? $$}
}
say Grammar::Headers.subparse("## Easier ##\n", :rule<header>);

Just change
regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}
to
regex header {^^ (<hashes>) \h+ <span> [\h* $0]? $$}
So that the capture works. Thanks to Eugene Barsky for calling this.

I played with this a bit because I thought there were two interesting things you might do.
First, you can make hashes take an argument about how many it will match. That way you can do special things based on the level if you like. You can reuse hashes in different parts of the grammar where you require different but exact numbers of hash marks.
Next, the ~ stitcher allows you to specify that something will show up in the middle of two things so you can put those wrapper things next to each other. For example, to match (Foo) you could write '(' ~ ')' Foo. With that it looks like I came up with the same thing you posted:
use Grammar::Tracer;
role Like-a-word {
regex like-a-word { \S+ }
}
role Span does Like-a-word {
regex span { <like-a-word>[\s+ <like-a-word>]* }
}
grammar Grammar::Headers does Span {
token TOP {^ <header> \v+ $}
token hashes ( $n = 1 ) { '#' ** {$n} }
regex header { [(<hashes(2)>) \h*] ~ [\h* $0] <span> }
}
my $result = Grammar::Headers.parse( "## Easier ##\n" );
say $result;

Related

What's wrong with this cucumber regex?

I tried to define this custom parameter type:
ParameterType("complex", """(?x)
| complex\( \s*
| (?:
| ($realRegex)
| (?:
| \s* ([+-]) \s* ($realRegex) \s* i
| )?
| |
| ($realRegex) \s* i
| )
| \s* \)
|""".trimMargin()) {
realPartString: String?, joiner: String?, imaginaryPartString1: String?, imaginaryPartString2: String? ->
// handling logic omitted as it doesn't matter yet
}
Cucumber is giving an error:
java.util.NoSuchElementException
at java.base/java.util.Spliterators$2Adapter.nextInt(Spliterators.java:733)
at java.base/java.util.PrimitiveIterator$OfInt.next(PrimitiveIterator.java:128)
at java.base/java.util.PrimitiveIterator$OfInt.next(PrimitiveIterator.java:86)
at io.cucumber.cucumberexpressions.GroupBuilder.build(GroupBuilder.java:18)
at io.cucumber.cucumberexpressions.GroupBuilder.build(GroupBuilder.java:21)
at io.cucumber.cucumberexpressions.GroupBuilder.build(GroupBuilder.java:21)
at io.cucumber.cucumberexpressions.TreeRegexp.match(TreeRegexp.java:78)
at io.cucumber.cucumberexpressions.Argument.build(Argument.java:15)
at io.cucumber.cucumberexpressions.CucumberExpression.match(CucumberExpression.java:144)
at io.cucumber.core.stepexpression.StepExpression.match(StepExpression.java:22)
at io.cucumber.core.stepexpression.ArgumentMatcher.argumentsFrom(ArgumentMatcher.java:30)
It sounds as if I have missed a closing group but I've been staring at it for hours and can't see what I've done wrong yet.
I thought maybe throwing it to StackOverflow would immediately get someone to spot the issue. LOL
In case it matters, the definition of $realRegex is as follows:
private const val doubleTermRegex = "\\d+(?:\\.\\d+)?"
private const val realTermRegex = "(?:-?(?:√?(?:$doubleTermRegex|π)|∞))"
const val realRegex = "(?:$realTermRegex(?:\\s*\\/\\s*$realTermRegex)?)"
That code has been around for much longer and is exercised by many tests, so I'm guessing the issue is in the new code, but I guess you never know.
Versions in use:
Cucumber-Java8 5.7.0
Kotlin-JVM 1.5.21
Java 11
In any case, here's the whole file for context.
package garden.ephemeral.rocket.util
import garden.ephemeral.rocket.util.RealParser.Companion.realFromString
import garden.ephemeral.rocket.util.RealParser.Companion.realRegex
import io.cucumber.java8.En
class ComplexStepDefinitions: En {
lateinit var z1: Complex
lateinit var z2: Complex
init {
ParameterType("complex", """(?x)
| complex\( \s*
| (?:
| ($realRegex)
| (?:
| \s* ([+-]) \s* ($realRegex) \s* i
| )?
| |
| ($realRegex) \s* i
| )
| \s* \)
|""".trimMargin()) {
realPartString: String?,
joiner: String?,
imaginaryPartString1: String?,
imaginaryPartString2: String? ->
if (realPartString != null) {
if (imaginaryPartString1 != null) {
val imaginarySign = if (joiner == "-") -1.0 else 1.0
Complex(realFromString(realPartString), imaginarySign * realFromString(imaginaryPartString1))
} else {
Complex(realFromString(realPartString), 0.0)
}
} else if (imaginaryPartString2 != null) {
Complex(0.0, realFromString(imaginaryPartString2))
} else {
throw AssertionError("It shouldn't have matched the regex")
}
}
}
}
You've used (?x) as flag expression to enabled white-space and comments in pattern. Which is probably a good idea given the size of the monster.
When using Cucumber Expressions, Cucumber parses your regex. And the parser didn't include support for flag expressions as you are experiencing now. Updating Cucumber to the latest version should fix that problem. You are welcome.

Match regex until the first occurrence of special character '->'

I am a newbie to regexp and was trying to match the expression until a special character/s. If the matches exist before the special character then return it otherwise return nothing.
Here is the demo.
My goal is to return the match if found before the '->' special character otherwise return nothing. It should not return the matches after the '->' special char.
Regexp: /()()(\[[^\]]+\])\s*(-[->])(.*)/g // In third group actual result will be returned
For example data:
[AAA] -> [BBB] -> [CCC] // In this case needs to match [AAA]
AAA -> [BBB] -> [CCC] // In this case do not return [BBB], instead return nothing as before special char '->', no matchng is there.
Please help me with this. Thanks in advance.
Use this regex
^\[(.*?)\] ->
and capture group 1 (inside the braces).
See this regex101 test
Is this what you want?
^\[[^\]]+\](?=\h*->)
Explanation:
^ # beginning of string
\[ # opening squarre bracket
[^\]]+ # 1 or more any character that is not closiing bracket
\] # closing bracket
(?= # start positive lookahead, make sure we have after:
\h* # 0 or more horizontal spaces
-> # literally ->
) # end lookahead
Demo
I don't fully understand the problem, just copying a few guessworks in here, that might get you closer to what you're trying to accomplish:
^()()((\[[^\]]*\]))\s*->(.*)
Demo 1
()()(\[[^\]]+\])\s*->(\s*\[[^\]]+\]\s*->\s*\[[^\]]+\])
Demo 2
()()(\[[^\]\r\n]+\])\s*->\s*(\[[^\]\r\n]+\]\s*->\s*\[[^\]\r\n]+\])
Demo 3
const regex = /^()()((\[[^\]]*\]))\s*(->)(.*)/gm;
const str = `[AAA] -> [BBB] -> [CCC]
AAA -> [BBB] -> [CCC]`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
RegEx Circuit
jex.im visualizes regular expressions:

Regex doesn't fetch the nested curly braces

Curly braces matches sometimes and doesn't in few case.
My Code:
use strict;
use warnings;
my $str1 = '$$\eqalign{&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\{\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+\!m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}$$';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
my $regex = qr/(?:[^{}]*(?:{(?:[^{}]*(?:{(?:[^{}]*(?:{[^{}]*})*[^{}]*)})*[^{}]*)*})*[^{}]*)*/;
if($str1=~m/\{$regex\}/) { print "str1: $&\n"; }
if($str2=~m/\{$regex\}/) { print "str2: $&\n"; }
OUTPUT:
str1: {&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\ {\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+ !m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}
str2: {2}
str1 is correct output. str2 incorrect output.
Expected Output on str2 is:
str2: {{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}
In the sample str1 string doesn't matched with the nested curly braces. However the second sample str12 string can matched the nested curly braces.
This is my question can matched the nested curly braces. I am clueless. It would be better if someone point out my mistake.
Thanks in advance.
Since your actual requirements (discussed in the chat) are to match substrings starting with \bib followed with {...} substrings or any chars other than { and }, you should use a regex with a subroutine:
/\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g
Details:
\\bib - \bib literal text
(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])* - 0+ occurrences of:
({(?:[^{}]++|(?1))*}) - Group 1 (that will be recursed with (?1)) matching
{ - a literal {
(?:[^{}]++|(?1))* - 0 or more occurrences of 1+ chars other than { and } or the whole Group 1 subpattern
} - a literal }
| - or
(?!\\bib)[^{}] - a char other than { and } not starting a \bib literal char sequence.
See the sample Perl code:
use strict;
use warnings;
use feature 'say';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
while($str2 =~ /\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g) {
say "$&";
}
Note The edit in the question adds \\bibcite{Airdetal2013} in front. However, this doesn't change the analysis below as it doesn't change the overall nesting levels.
This has got to be possible to do in a better way. There is recursive regex offered by Wiktor Stribiżew in comments. There are modules for recursive parsing. And there are tools for parsing Latex.
However, out of curiosity ...
Your string, shortened suitably
my $str2 = "{{2}{2017}{{{John}{et~al.}}}{{{James}, ... {Gnali}, \& {Giito}}}}";
or, with C standing for a pair of curlies with something inside (no nesting)
"{ C C { { C C } { C, ... \& C } } }"
So you have three levels of nesting, to get down to the last pair {...} (no further nesting).
Your regex, spread out and with $nc = qr/[^{}]*/ (Non-Curlies), so that we can look at it
my $regex = qr/
(?: $nc
(?: {
(?: $nc
(?: {
(?: $nc (?: { $nc } )* $nc )
}
)* $nc
)*
}
)* $nc
)*/x;
I can count two levels here. (The $nc has no curlies so { $nc } matches my C above.)
Thus this regex cannot match that whole string.
How to fix it? Best, find another way so to not drown in this.
Or, write it out like above, very carefully, and add the missing level.

Perl6 grammar and action error : "Cannot find method 'ann' on object of type NQPMu"

Okay, I am still having trouble with perl6 grammar and action. I want to find a pattern in a string, and as soon as it is is found, change the pattern according to action, and return the modified string.
my $test = "xx, 1-March-23, 23.feb.21, yy foo 12/january/2099 , zzz";
# want this result: xx, 010323, 230221, yy foo 120199 , zzz";
# 2 digits for day, month, year
grammar month {
regex TOP { <unit>+ }
regex unit { <before> <form1> <after> }
regex before { .*? }
regex after { .*? }
regex form1 { \s* <dd> <slash> <mon> <slash> <yy> \s* }
regex slash { \s* <[ \- \/ \. ]> \s* }
regex dd { \d ** 1..2 }
regex yy { (19 | 20)? \d\d }
proto regex mon {*}
regex mon:sym<jan> { \w 'an' \w* }
regex mon:sym<feb> { <sym> }
regex mon:sym<mar> { <[Mm]> 'ar' \w* }
}
class monAct {
method TOP ($/) { make $<unit>.map({.made}); }
method unit ($/) { make $<before> ~ $<form1>.made ~$<after>; }
method form1 ($/) { make $<dd>.made ~ $<mon>.made ~ $<yy>; }
method dd ($/) {
my $ddStr = $/.Str;
if $ddStr.chars == 1 { make "0" ~ $ddStr; } else { make $ddStr; }
}
method mon:sym<jan> ($/) { make "01"; };
method mon:sym<feb> ($/) { make "02"; };
method mon:sym<mar> ($/) { make "03"; };
}
my $m = month.parse($test, actions => monAct.new);
say $m;
say $m.made;
But it says:
===SORRY!===Cannot find method 'ann' on object of type NQPMu
What did I do wrong ? Thank you for your help !!!
This looks like a bug in Rakudo to me, possibly related to before being part of the syntax for lookahead assertions.
It can already be triggered with a simple / <before> /:
$ perl6 --version
This is Rakudo version 2016.11-20-gbd42363 built on MoarVM version 2016.11-10-g0132729
implementing Perl 6.c.
$ perl6 -e '/ <before> /'
===SORRY!===
Cannot find method 'ann' on object of type NQPMu
At the very least, it's a case of a less than awesome error message.
You should report this to rakudobug#perl.org, cf
How to report a bug.

Regular Expression : Splitting a string of list of multivalues

My goal is splitting this string with regular expression:
AA(1.2,1.3)+,BB(125)-,CC(A,B,C)-,DD(QWE)+
in a list of:
AA(1.2,1.3)+
BB(125)-
CC(A,B,C)-
DD(QWE)+
Regards.
This regex works with your sample string:
,(?![^(]+\))
This splits on comma, but uses a negative lookahead to assert that the next bracket character is not a right bracket. It will still split even if there are no following brackets.
Here's some java code demonstrating it working with your sample plus some general input showing its robustness:
String input = "AA(1.2,1.3)+,BB(125)-,FOO,CC(A,B,C)-,DD(QWE)+,BAR";
String[] split = input.split(",(?![^(]+\\))");
for (String s : split) System.out.println(s);
Output:
AA(1.2,1.3)+
BB(125)-
FOO
CC(A,B,C)-
DD(QWE)+
BAR
I don't know what language you are working with, but this makes it in grep:
$ grep -o '[A-Z]*([A-Z0-9.,]*)[^,]*' file
AA(1.2,1.3)+
BB(125)-
CC(A,B,C)-
DD(QWE)+
Explanation
[A-Z]*([A-Z0-9.,]*)[^,]*
^^^^^^ ^^^^^^^^^^^ ^^^^^
| ^ | ^ |
| | | | everything but a comma
| ( char | ) char
| A-Z 0-9 . or , chars
list of chars from A to Z