What's wrong with this cucumber regex? - regex

I tried to define this custom parameter type:
ParameterType("complex", """(?x)
| complex\( \s*
| (?:
| ($realRegex)
| (?:
| \s* ([+-]) \s* ($realRegex) \s* i
| )?
| |
| ($realRegex) \s* i
| )
| \s* \)
|""".trimMargin()) {
realPartString: String?, joiner: String?, imaginaryPartString1: String?, imaginaryPartString2: String? ->
// handling logic omitted as it doesn't matter yet
}
Cucumber is giving an error:
java.util.NoSuchElementException
at java.base/java.util.Spliterators$2Adapter.nextInt(Spliterators.java:733)
at java.base/java.util.PrimitiveIterator$OfInt.next(PrimitiveIterator.java:128)
at java.base/java.util.PrimitiveIterator$OfInt.next(PrimitiveIterator.java:86)
at io.cucumber.cucumberexpressions.GroupBuilder.build(GroupBuilder.java:18)
at io.cucumber.cucumberexpressions.GroupBuilder.build(GroupBuilder.java:21)
at io.cucumber.cucumberexpressions.GroupBuilder.build(GroupBuilder.java:21)
at io.cucumber.cucumberexpressions.TreeRegexp.match(TreeRegexp.java:78)
at io.cucumber.cucumberexpressions.Argument.build(Argument.java:15)
at io.cucumber.cucumberexpressions.CucumberExpression.match(CucumberExpression.java:144)
at io.cucumber.core.stepexpression.StepExpression.match(StepExpression.java:22)
at io.cucumber.core.stepexpression.ArgumentMatcher.argumentsFrom(ArgumentMatcher.java:30)
It sounds as if I have missed a closing group but I've been staring at it for hours and can't see what I've done wrong yet.
I thought maybe throwing it to StackOverflow would immediately get someone to spot the issue. LOL
In case it matters, the definition of $realRegex is as follows:
private const val doubleTermRegex = "\\d+(?:\\.\\d+)?"
private const val realTermRegex = "(?:-?(?:√?(?:$doubleTermRegex|π)|∞))"
const val realRegex = "(?:$realTermRegex(?:\\s*\\/\\s*$realTermRegex)?)"
That code has been around for much longer and is exercised by many tests, so I'm guessing the issue is in the new code, but I guess you never know.
Versions in use:
Cucumber-Java8 5.7.0
Kotlin-JVM 1.5.21
Java 11
In any case, here's the whole file for context.
package garden.ephemeral.rocket.util
import garden.ephemeral.rocket.util.RealParser.Companion.realFromString
import garden.ephemeral.rocket.util.RealParser.Companion.realRegex
import io.cucumber.java8.En
class ComplexStepDefinitions: En {
lateinit var z1: Complex
lateinit var z2: Complex
init {
ParameterType("complex", """(?x)
| complex\( \s*
| (?:
| ($realRegex)
| (?:
| \s* ([+-]) \s* ($realRegex) \s* i
| )?
| |
| ($realRegex) \s* i
| )
| \s* \)
|""".trimMargin()) {
realPartString: String?,
joiner: String?,
imaginaryPartString1: String?,
imaginaryPartString2: String? ->
if (realPartString != null) {
if (imaginaryPartString1 != null) {
val imaginarySign = if (joiner == "-") -1.0 else 1.0
Complex(realFromString(realPartString), imaginarySign * realFromString(imaginaryPartString1))
} else {
Complex(realFromString(realPartString), 0.0)
}
} else if (imaginaryPartString2 != null) {
Complex(0.0, realFromString(imaginaryPartString2))
} else {
throw AssertionError("It shouldn't have matched the regex")
}
}
}
}

You've used (?x) as flag expression to enabled white-space and comments in pattern. Which is probably a good idea given the size of the monster.
When using Cucumber Expressions, Cucumber parses your regex. And the parser didn't include support for flag expressions as you are experiencing now. Updating Cucumber to the latest version should fix that problem. You are welcome.

Related

positive lookbehind in kotlin doesn't work in match

I'm iterating on this file:
[INFO] com.demo:communication:jar:3.5.0-SNAPSHOT
[INFO] +- com.cellwize.optserver:optserver-admin:jar:3.5.0-SNAPSHOT:compile
[INFO] | +- org.apache.logging.log4j:log4j-api:jar:2.7:compile
[INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.7:compile
[INFO] | | \- (org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate)
[INFO] | +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.7:compile
[INFO] | | +- org.slf4j:slf4j-api:jar:1.7.21:compile
[INFO] | | \- (org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate)
I want to remove all the prefix on every line: "[INFO] " / "[INFO] +- " / "[INFO] | | - " etc
I'm using this function I wrote on every line in the file:
private fun extractDependency(raw: String): Dependency {
val uniqueDependencyRegex = Regex.fromLiteral("(?<=\\+- ).*")
val duplicateDependencyRegex = Regex.fromLiteral("(?<=\\().+?(?=\\))")
val projectRegex = Regex.fromLiteral("(?<=\\[INFO\\] ).*")
when {
uniqueDependencyRegex matches raw -> {
val matchResult = uniqueDependencyRegex.matchEntire(raw)
println(matchResult)
}
duplicateDependencyRegex matches raw -> {
val matchResult = duplicateDependencyRegex.matchEntire(raw)
println(matchResult)
}
projectRegex matches raw -> {
val matchResult = projectRegex.matchEntire(raw)
println(matchResult)
}
else -> {
//TODO - throw exception
}
}
return Dependency("test", "test", "test", "test")
}
I'm expecting it to work after I tested the regular expressions:
First Condition
Second Condition
Third Condition
The result I want is:
com.demo:communication:jar:3.5.0-SNAPSHOT
com.cellwize.optserver:optserver-admin:jar:3.5.0-SNAPSHOT:compile
org.apache.logging.log4j:log4j-api:jar:2.7:compile
org.apache.logging.log4j:log4j-core:jar:2.7:compile
org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate
org.apache.logging.log4j:log4j-slf4j-impl:jar:2.7:compile
org.slf4j:slf4j-api:jar:1.7.21:compile
org.apache.logging.log4j:log4j-api:jar:2.7:compile - omitted for duplicate
You could either match [INFO] followed by a character class that will match any of the listed characters [| +\\(-], or match ) at the end of the string.
In the replacement use an empty string.
^\[INFO\][| +\\(-]+|\)$
With double escaped backslashes
^\\[INFO\\][| +\\\\(-]+|\\)$
regex demo
A bit more precise pattern could be repeatedly matching any of the occurring patterns like | or +- or \- and capture the content in group 1 between optional parenthesis. Then use the group in the replacement.
^\[INFO\](?:(?: +(?:\||\+-|\\-))+)? +\(?(.*?)\)?$
Regex demo

Grammar a bit too greedy in Perl6

I am having problems with this mini-grammar, which tries to match markdown-like header constructs.
role Like-a-word {
regex like-a-word { \S+ }
}
role Span does Like-a-word {
regex span { <like-a-word>[\s+ <like-a-word>]* }
}
grammar Grammar::Headers does Span {
token TOP {^ <header> \v+ $}
token hashes { '#'**1..6 }
regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}
}
I would like it to match ## Easier ## as a header, but instead it takes ## as part of span:
TOP
| header
| | hashes
| | * MATCH "##"
| | span
| | | like-a-word
| | | * MATCH "Easier"
| | | like-a-word
| | | * MATCH "##"
| | | like-a-word
| | | * FAIL
| | * MATCH "Easier ##"
| * MATCH "## Easier ##"
* MATCH "## Easier ##\n"
「## Easier ##
」
header => 「## Easier ##」
hashes => 「##」
span => 「Easier ##」
like-a-word => 「Easier」
like-a-word => 「##」
The problem is that the [\h* $0]? simply does not seem to work, with span gobbling up all available words. Any idea?
First, as others have pointed out, <hashes> does not capture into $0, but instead, it captures into $<hashes>, so you have to write:
regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}
But that still doesn't match the way you want, because the [\h* $<hashes>]? part happily matches zero occurrences.
The proper fix is to not let span match ## as a word:
role Like-a-word {
regex like-a-word { <!before '#'> \S+ }
}
role Span does Like-a-word {
regex span { <like-a-word>[\s+ <like-a-word>]* }
}
grammar Grammar::Headers does Span {
token TOP {^ <header> \v+ $}
token hashes { '#'**1..6 }
regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}
}
say Grammar::Headers.subparse("## Easier ##\n", :rule<header>);
If you are loath to modify like-a-word, you can also force the exclusion of a final # from it like this:
role Like-a-word {
regex like-a-word { \S+ }
}
role Span does Like-a-word {
regex span { <like-a-word>[\s+ <like-a-word>]* }
}
grammar Grammar::Headers does Span {
token TOP {^ <header> \v+ $}
token hashes { '#'**1..6 }
regex header {^^ <hashes> \h+ <span> <!after '#'> [\h* $<hashes>]? $$}
}
say Grammar::Headers.subparse("## Easier ##\n", :rule<header>);
Just change
regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}
to
regex header {^^ (<hashes>) \h+ <span> [\h* $0]? $$}
So that the capture works. Thanks to Eugene Barsky for calling this.
I played with this a bit because I thought there were two interesting things you might do.
First, you can make hashes take an argument about how many it will match. That way you can do special things based on the level if you like. You can reuse hashes in different parts of the grammar where you require different but exact numbers of hash marks.
Next, the ~ stitcher allows you to specify that something will show up in the middle of two things so you can put those wrapper things next to each other. For example, to match (Foo) you could write '(' ~ ')' Foo. With that it looks like I came up with the same thing you posted:
use Grammar::Tracer;
role Like-a-word {
regex like-a-word { \S+ }
}
role Span does Like-a-word {
regex span { <like-a-word>[\s+ <like-a-word>]* }
}
grammar Grammar::Headers does Span {
token TOP {^ <header> \v+ $}
token hashes ( $n = 1 ) { '#' ** {$n} }
regex header { [(<hashes(2)>) \h*] ~ [\h* $0] <span> }
}
my $result = Grammar::Headers.parse( "## Easier ##\n" );
say $result;

Perl6 grammar and action error : "Cannot find method 'ann' on object of type NQPMu"

Okay, I am still having trouble with perl6 grammar and action. I want to find a pattern in a string, and as soon as it is is found, change the pattern according to action, and return the modified string.
my $test = "xx, 1-March-23, 23.feb.21, yy foo 12/january/2099 , zzz";
# want this result: xx, 010323, 230221, yy foo 120199 , zzz";
# 2 digits for day, month, year
grammar month {
regex TOP { <unit>+ }
regex unit { <before> <form1> <after> }
regex before { .*? }
regex after { .*? }
regex form1 { \s* <dd> <slash> <mon> <slash> <yy> \s* }
regex slash { \s* <[ \- \/ \. ]> \s* }
regex dd { \d ** 1..2 }
regex yy { (19 | 20)? \d\d }
proto regex mon {*}
regex mon:sym<jan> { \w 'an' \w* }
regex mon:sym<feb> { <sym> }
regex mon:sym<mar> { <[Mm]> 'ar' \w* }
}
class monAct {
method TOP ($/) { make $<unit>.map({.made}); }
method unit ($/) { make $<before> ~ $<form1>.made ~$<after>; }
method form1 ($/) { make $<dd>.made ~ $<mon>.made ~ $<yy>; }
method dd ($/) {
my $ddStr = $/.Str;
if $ddStr.chars == 1 { make "0" ~ $ddStr; } else { make $ddStr; }
}
method mon:sym<jan> ($/) { make "01"; };
method mon:sym<feb> ($/) { make "02"; };
method mon:sym<mar> ($/) { make "03"; };
}
my $m = month.parse($test, actions => monAct.new);
say $m;
say $m.made;
But it says:
===SORRY!===Cannot find method 'ann' on object of type NQPMu
What did I do wrong ? Thank you for your help !!!
This looks like a bug in Rakudo to me, possibly related to before being part of the syntax for lookahead assertions.
It can already be triggered with a simple / <before> /:
$ perl6 --version
This is Rakudo version 2016.11-20-gbd42363 built on MoarVM version 2016.11-10-g0132729
implementing Perl 6.c.
$ perl6 -e '/ <before> /'
===SORRY!===
Cannot find method 'ann' on object of type NQPMu
At the very least, it's a case of a less than awesome error message.
You should report this to rakudobug#perl.org, cf
How to report a bug.

Couchbase xdcr regex - How do I exclude keys using regex?

I am trying to exclude certain documents from being transported to ES using XDCR.
I have the following regex that filters ABCD and IJ
https://regex101.com/r/gI6sN8/11
Now, I want to use this regex in the XDCR filtering
^(?!.(ABCD|IJ)).$
How do I exclude keys using regex?
EDIT:
What if I want to select everything that doesn't contains ABCDE and ABCHIJ.
I tried
https://regex101.com/r/zT7dI4/1
edit:
Sorry, after further looking at it, this method is invalid. For instance, [^B] allows an A to get by, letting AABCD slip through (since it will match AA at first, then match BCD with the [^A]. Please disregard this post.
Demo here shows below method is invalid
(disregard this)
You could use a posix style trick to exclude words.
Below is to exclude ABCD and IJ.
You get a sense of the pattern from this.
Basically, you put all the first letters into a negative class
as the first in the alternation list, then handle each word
in a separate alternation.
^(?:[^AI]+|(?:A(?:[^B]|$)|AB(?:[^C]|$)|ABC(?:[^D]|$))|(?:I(?:[^J]|$)))+$
Demo
Expanded
^
(?:
[^AI]+
|
(?: # Handle 'ABCD`
A
(?: [^B] | $ )
| AB
(?: [^C] | $ )
| ABC
(?: [^D] | $ )
)
|
(?: # Handle 'IJ`
I
(?: [^J] | $ )
)
)+
$
Hopefully one day there will be built-in support for inverting the match expression. In the mean time, here's a Java 8 program that generates regular expressions for inverted prefix matching using basic regex features supported by the Couchbase XDCR filter.
This should work as long as your key prefixes are somehow delimited from the remainder of the key. Make sure to include the delimiter in the input when modifying this code.
Sample output for red:, reef:, green: is:
^([^rg]|r[^e]|g[^r]|re[^de]|gr[^e]|red[^:]|ree[^f]|gre[^e]|reef[^:]|gree[^n]|green[^:])
File: NegativeLookaheadCheater.java
import java.util.*;
import java.util.stream.Collectors;
public class NegativeLookaheadCheater {
public static void main(String[] args) {
List<String> input = Arrays.asList("red:", "reef:", "green:");
System.out.println("^" + invertMatch(input));
}
private static String invertMatch(Collection<String> literals) {
int maxLength = literals.stream().mapToInt(String::length).max().orElse(0);
List<String> terms = new ArrayList<>();
for (int i = 0; i < maxLength; i++) {
terms.addAll(terms(literals, i));
}
return "(" + String.join("|", terms) + ")";
}
private static List<String> terms(Collection<String> words, int index) {
List<String> result = new ArrayList<>();
Map<String, Set<Character>> prefixToNextLetter = new LinkedHashMap<>();
for (String word : words) {
if (word.length() > index) {
String prefix = word.substring(0, index);
prefixToNextLetter.computeIfAbsent(prefix, key -> new LinkedHashSet<>()).add(word.charAt(index));
}
}
prefixToNextLetter.forEach((literalPrefix, charsToNegate) -> {
result.add(literalPrefix + "[^" + join(charsToNegate) + "]");
});
return result;
}
private static String join(Collection<Character> collection) {
return collection.stream().map(c -> Character.toString(c)).collect(Collectors.joining());
}
}

PCRE regex for multiple decimal coordinates using [lon,lat] format

I am trying to create a regex for [lon,lat] coordinates.
The code first checks if the input starts with '['.
If it does we check the validity of the coordinates via a regex
/([\[][-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)[\]][\;]?)+/gm
The regex tests for [lon,lat] with 15 decimals [+- 180degrees, +-90degrees]
it should match :
single coordinates :
[120,80];
[120,80]
multiple coordinates
[180,90];[180,67];
[180,90];[180,67]
with newlines
[123,34];[-32,21];
[12,-67]
it should not match:
semicolon separator missing - single
[25,67][76,23];
semicolon separator missing - multiple
[25,67]
[76,23][12,90];
I currently have problems with the ; between coordinates (see 4 & 5)
jsfiddle equivalent here : http://regex101.com/r/vQ4fE0/4
You can try with this (human readable) pattern:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<lon> [+-]?
(?:
180 (?:\.0{1,15})?
|
(?: 1(?:[0-7][0-9]?)? | [2-9][0-9]? | 0 )
(?:\.[0-9]{1,15})?
)
)
(?<lat> [+-]?
(?:
90 (?:\.0{1,15})?
|
(?: [1-8][0-9]? | 9)
(?:\.[0-9]{1,15})?
)
)
)
\A
\[ \g<lon> , \g<lat> ] (?: ; \n? \[ \g<lon> , \g<lat> ] )* ;?
\z
~x
EOD;
explanations:
When you have to deal with a long pattern inside which you have to repeat several time the same subpatterns, you can use several features to make it more readable.
The most well know is to use the free-spacing mode (the x modifier) that allows to indent has you want the pattern (all spaces are ignored) and eventually to add comments.
The second consists to define subpatterns in a definition section (?(DEFINE)...) in which you can define named subpatterns to be used later in the main pattern.
Since I don't want to repeat the large subpatterns that describes the longitude number and the latitude number, I have created in the definition section two named pattern "lon" and "lat". To use them in the main pattern, I only need to write \g<lon> and \g<lat>.
javascript version:
var lon_sp = '(?:[+-]?(?:180(?:\\.0{1,15})?|(?:1(?:[0-7][0-9]?)?|[2-9][0-9]?|0)(?:\\.[0-9]{1,15})?))';
var lat_sp = '(?:[+-]?(?:90(?:\\.0{1,15})?|(?:[1-8][0-9]?|9)(?:\\.[0-9]{1,15})?))';
var coo_sp = '\\[' + lon_sp + ',' + lat_sp + '\\]';
var regex = new RegExp('^' + coo_sp + '(?:;\\n?' + coo_sp + ')*;?$');
var coordinates = new Array('[120,80];',
'[120,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]');
for (var i = 0; i<coordinates.length; i++) {
console.log("\ntest "+(i+1)+": " + regex.test(coordinates[i]));
}
fiddle
Try this out:
^(\[([+-]?(?!(180\.|18[1-9]|19\d{1}))\d{1,3}(\.\d{1,15})?,[+-]?(?!(90\.|9[1-9]))\d{1,2}(\.\d{1,15})?(\];$|\]$|\];\[)){1,})
Demo: http://regex101.com/r/vQ4fE0/7
Explanation
^(\[
Must start with a bracket
[+-]?
May or may not contain +- in front of the number
(?!(180\.|18[1-9]|19\d{1}))
Should not contain 180., 181-189 nor 19x
\d{1,3}(\.\d{1,15})?
Otherwise, any number containing 1 or 3 digits, with or without decimals (up to 15) are allowed
(?!(90\.|9[1-9]))
The 90 check is similar put here we are not allowing 90. nor 91-99
\d{1,2}(\.\d{1,15})?
Otherwise, any number containing 1 or 2 digits, with or without decimals (up to 15) are allowed
(\];$|\]$|\];\[)
The ending of a bracket body must have a ; separating two bracket bodies, otherwise it must be the end of the line.
{1,}
The brackets can exist 1 or multiple times
Hope this was helpful.
This might work. Note that you have a lot of capture groups, none of which
will give you good information because of recursive quantifiers.
# /^(\[[-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)\](?:;\n?|$))+$/
^
( # (1 start)
\[
[-+]?
( # (2 start)
180
( \. 0{1,15} )? # (3)
|
( # (4 start)
( 1 [0-7] \d ) # (5)
|
( [1-9]? \d ) # (6)
) # (4 end)
( \. \d{1,15} )? # (7)
) # (2 end)
,
[-+]?
( # (8 start)
[1-8]? \d
( \. \d{1,15} )? # (9)
|
90
( \. 0{1,15} )? # (10)
) # (8 end)
\]
(?: ; \n? | $ )
)+ # (1 end)
$
Try a function approach, where the function can do some of the splitting for you, as well as delegating the number comparisons away from the regex. I tested it here: http://repl.it/YyG/3
//represents regex necessary to capture one coordinate, which
// looks like 123 or 123.13532
// the decimal part is a non-capture group ?:
var oneCoord = '(-?\\d+(?:\\.\\d+)?)';
//console.log("oneCoord is: "+oneCoord+"\n");
//one coordinate pair is represented by [x,x]
// check start/end with ^, $
var coordPair = '^\\['+oneCoord+','+oneCoord+'\\]$';
//console.log("coordPair is: "+coordPair+"\n");
//the full regex string consists of one or more coordinate pairs,
// but we'll do the splitting in the function
var myRegex = new RegExp(coordPair);
//console.log("my regex is: "+myRegex+"\n");
function isPlusMinus180(x)
{
return -180.0<=x && x<=180.0;
}
function isPlusMinus90(y)
{
return -90.0<=y && y<=90.0;
}
function isValid(s)
{
//if there's a trailing semicolon, remove it
if(s.slice(-1)==';')
{
s = s.slice(0,-1);
}
//remove all newlines and split by semicolon
var all = s.replace(/\n/g,'').split(';');
//console.log(all);
for(var k=0; k<all.length; ++k)
{
var match = myRegex.exec(all[k]);
if(match===null)
return false;
console.log(" match[1]: "+match[1]);
console.log(" match[2]: "+match[2]);
//break out if one pair is bad
if(! (isPlusMinus180(match[1]) && isPlusMinus90(match[2])) )
{
console.log(" one of matches out of bounds");
return false;
}
}
return true;
}
var coords = new Array('[120,80];',
'[120.33,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]',
'[190,33.33]',
'[180.33,33]',
'[179.87,90]',
'[179.87,91]');
var s;
for (var i = 0; i<coords.length; i++) {
s = coords[i];
console.log((i+1)+". ==== testing "+s+" ====");
console.log(" isValid? => " + isValid(s));
}