issue with regexp with nested groups in golang - regex

Consider the following toy example. I want to match in Go a name with a regexp where the name is sequences of letters a separated by single #, so a#a#aaa is valid, but a# or a##a are not. I can code the regexp in the following two ways:
r1 := regexp.MustCompile(`^a+(#a+)*$`)
r2 := regexp.MustCompile(`^(a+#)*a+$`)
Both of these work. Now consider more complex task of matching a sequence of names separated by single slash. As in above, I can code that in two ways:
^N+(/N+)*$
^(N+/)*N+$
where N is a regexp for the name with ^ and $ stripped. As I have two cases for N, so now I can have 4 regexps:
^a+(#a+)*(/a+(#a+)*)*$
^(a+#)*a+(/a+(#a+)*)*$
^((a+#)*a+/)*a+(#a+)*$
^((a+#)*a+/)*(a+#)*a+$
The question is why when matching against the string "aa#a#a/a#a/a" the first one fails while the rest 3 cases work as expected? I.e. what causes the first regexp to mismatch? The full code is:
package main
import (
"fmt"
"regexp"
)
func main() {
str := "aa#a#a/a#a/a"
regs := []string {
`^a+(#a+)*(/a+(#a+)*)*$`,
`^(a+#)*a+(/a+(#a+)*)*$`,
`^((a+#)*a+/)*a+(#a+)*$`,
`^((a+#)*a+/)*(a+#)*a+$`,
}
for _, r := range(regs) {
fmt.Println(regexp.MustCompile(r).MatchString(str))
}
}
Surprisingly it prints false true true true

This must be a bug in golang regexp engine. A simpler test case is that ^a(/a+(#a+)*)*$ fails to match a/a#a while ^(a)(/a+(#a+)*)*$ works, see http://play.golang.org/p/CDKxVeXW98 .
I filed https://github.com/golang/go/issues/11905

You can try to alleviate atomic sub-nesting of quantifiers.
If you didn't have the anchors, the expression would possibly blow up
in backtracking if using nested optional quantifiers like this,
when it can't find a direct solution.
Go might be balking at that, forcing it to fail outright instead
of massive backtracking, but not sure.
Try something like this, see if it works.
# ^a+(#?a)*(/a+(#?a)*)*$
^
a+
( # (1 start)
\#?
a
)* # (1 end)
( # (2 start)
/
a+
( # (3 start)
\#?
a
)* # (3 end)
)* # (2 end)
$
edit: (Transposed from comment) If complexity is too high, some engines will not even compile it, some will silently fail at runtime, some will lock up in a backtracking loop.
This little subtle difference is the problem
bad: too complex (#?a+)*
good: no nested, open ended quantifiers (#?a)*
If you ever have this problem, strip out the nesting, usually solves
the backtracking issue.
eit2: If you require a separator and want to insure a separator is only in the middle and surrounded by a's, you could try this
https://play.golang.org/p/oM6B6H3Kdx
# ^a+[#/](a[#/]?)*a+$
^
a+
[\#/]
( # (1 start)
a
[\#/]?
)* # (1 end)
a+
$
or this
https://play.golang.org/p/WihqSjH_dI
# ^a+(?:[#/]a+)+$
^
a+
(?:
[\#/]
a+
)+
$

Related

how to get numbers from array of strings?

I have this array of strings.
["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "|", " ", "Total", "chain", "value:", "4,948"]
and I'm trying to get numbers in one line of code. I tried many methods but wasn't really helpful as am expecting.
I'm using one with grep method:
array.grep(/\d+/, &:to_i) #[9, 4]
but it returns an array of first integers only. It seems like I have to add something to the pattern but I don't know what.
Or there is another way to grab these numbers in an Array?
you can use:
array.grep(/[\d,]+\.?\d+/)
if you want int:
array.grep(/[\d,]+\.?\d+/).map {_1.gsub(/[^0-9\.]/, '').to_i}
and a faster way (about 5X to 10X):
array.grep(/[\d,]+\.?\d+/).map { _1.delete("^0-9.").to_i }
for a data like:
%w[
,,,4
1
1.2.3.4
-2
1,2,3
9,999.00
4,948
22,956
22,536,129,336
123,456
12.]
use:
data.grep(/^-?\d{1,3}(,\d{3})*(\.?\d+)?$/)
output:
["1", "-2", "9,999.00", "4,948", "22,956", "22,536,129,336", "123,456"]
arr = ["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "61.4.5",
"|", "chain", "-4,948", "3,25.61", "1,234,567.899"]
rgx = /\A\-?\d{1,3}(?:,\d{3})*(?:\.\d+)?\z/
arr.grep(rgx)
#=> ["9,999.00", "-4,948", "1,234,567.899"]
Regex demo. At the link the regular expression was evaluated with the PCRE regex engine but the results are the same when Ruby's Onigmo engine is used. Also, at the link I've used the anchors ^ and $ (beginning and end of line) instead of \A and \z (beginning and end of string) in order test the regex against multiple strings.
The regular expression can be broken down as follows.
/
\A # match the beginning of the string
\-? # optionally match '-'
\d{1,3} # match between 1 and 3 digits inclusively
(?: # begin a non-capture group
,\d{3} # match a comma followed by 3 digits
)* # end the non-capture group and execute 0 or more times
(?: # begin a non-capture group
\.\d+ # match a period followed by one or more digits
)? # end the non-capture and make it optional
\z # match the end of the string
/
To make the test more robust we could use the methods Kernel::Float, Kernel::Rational and Kernel::Complex, all with the optional argument :exception set to false.
arr = ["Total", "9,999.00", " ", "61.4.5", "23e4", "-234.7e-2", "1+2i",
"3/4", "|", "chain", "-4,948", "3,25.61", "1,234,567.899", "10"]
arr.select { |s| s.match?(rxg) || Float(s, exception: false) ||
Rational(s, exception: false) Complex(s, exception: false) }
#=> ["9,999.00", "23e4", "-234.7e-2", "1+2i", "3/4", "-4,948",
# "1,234,567.899", "10"]
Note that "23e4", "-234.7e-2", "1+2i" and "3/4" are respectively the string representations of an integer, float, complex and rational number.

Regex - math with cycle

How to count the amount of a match inside itself to skip some characters?
Example:
I have:
(a(b(c)))
If I run this regex: \(.+?\)
It will be return: (a(b(c)
But what I want is the ) that closes the loop, that is, the third.
I could just remove the ? From the regex but there is a problem:
Ex: \(.+\) to (a)(a(b(c))) return (a)(a(b(c)))
And what I want is for the group to return to me with the closed loop of (), that is, it should return 2 matchs to me:
match 1: (a)
match 2: (a(b(c)))
What is the question of counting in the match? Well, what I thought was if there is any way to count how many ( passed to know how many ) one should skip, that is:
1 2 3 1 2 3
( a ( b ( c ) ) )
Does anyone have any idea how to do this just using regex?
If you need to use regex, please try the recursive regex (?R).
The implementation depends on the language so let me explain it with python.
#!/usr/bin/python
import regex
str ='(a)(a(b(c)))'
m = regex.findall(r'\((?:[^()]|(?R))+\)', str)
print(m)
Output:
['(a)', '(a(b(c)))']
Explanation of the regex pattern \((?:[^()]|(?R))+\):
The inner part (?:[^()]|(?R))+ matches:
one or more [^()] or (?R) where
[^()] matches any character other than parentheses.
(?R) represents the entire regex \((?:[^()]|(?R))+\) recursively.

Match very specific open and close brackets

I am trying to modify some LaTeX Beamer code, and want to do a quick regex find a certain pattern that defines a block of code. For example, in
\only{
\begin{equation*}
\psi_{w}(B)(1-B)^{d}y_{t,w} = x^{T}_{t,w}\gamma_{w} + \eta_{w}(B)z_{t,w},
\label{eq:gen_arima}
\end{equation*}
}
I want to match just the \only{ and the final }, and nothing else, to remove them. Is that even possible?
Regexes cannot count, as expressed in this famous SO answer: RegEx match open tags except XHTML self-contained tags
In addition, LaTeX itself has a fairly complicated grammar (i.e. macro expansions?). Ideally, you'd use a parser of some kind, but perhaps that's a little overkill. If you're looking for a really simple, quick and dirty hack which will only work for a certain class of inputs you can:
Search for \only.
Increment a counter every time you see a { and decrement it every time you see a }. If there is a \ preceding the {, ignore it. (Fancy points if you look for an odd number of \.)
When counter gets to 0, you've found the closing }.
Again, this is not reliable.
I want to remove \only{ and }, and keep everything within it
On a PCRE (php), Perl, Notepad++ it's done like this:
For something as simple as this, all you need is
Find \\only\s*({((?:[^{}]++|(?1))*)})
Replace $2
https://regex101.com/r/U3QxGa/1
Explained
\\ only \s*
( # (1 start)
{
( # (2 start), Inner core to keep
(?: # Cluster group
[^{}]++ # Possesive, not {}
| # or,
(?1) # Recurse to group 1
)* # End cluster, do 0 to many times
) # (2 end)
}
) # (1 end)

How do I access the captures within a match?

I am trying to parse a csv file, and I am trying to access names regex in proto regex in Perl6. It turns out to be Nil. What is the proper way to do it?
grammar rsCSV {
regex TOP { ( \s* <oneCSV> \s* \, \s* )* }
proto regex oneCSV {*}
regex oneCSV:sym<noQuote> { <-[\"]>*? }
regex oneCSV:sym<quoted> { \" .*? \" } # use non-greedy match
}
my $input = prompt("Enter csv line: ");
my $m1 = rsCSV.parse($input);
say "===========================";
say $m1;
say "===========================";
say "1 " ~ $m1<oneCSV><quoted>; # this fails; it is "Nil"
say "2 " ~ $m1[0];
say "3 " ~ $m1[0][2];
Detailed discussion complementing Christoph's answer
I am trying to parse a csv file
Perhaps you are focused on learning Raku parsing and are writing some throwaway code. But if you want industrial strength CSV parsing out of the box, please be aware of the Text::CSV modules[1].
I am trying to access a named regex
If you are learning Raku parsing, please take advantage of the awesome related (free) developer tools[2].
in proto regex in Raku
Your issue is unrelated to it being a proto regex.
Instead the issue is that, while the match object corresponding to your named capture is stored in the overall match object you stored in $m1, it is not stored precisely where you are looking for it.
Where do match objects corresponding to captures appear?
To see what's going on, I'll start by simulating what you were trying to do. I'll use a regex that declares just one capture, a "named" (aka "Associative") capture that matches the string ab.
given 'ab'
{
my $m1 = m/ $<named-capture> = ( ab ) /;
say $m1<named-capture>;
# 「ab」
}
The match object corresponding to the named capture is stored where you'd presumably expect it to appear within $m1, at $m1<named-capture>.
But you were getting Nil with $m1<oneCSV>. What gives?
Why your $m1<oneCSV> did not work
There are two types of capture: named (aka "Associative") and numbered (aka "Positional"). The parens you wrote in your regex that surrounded <oneCSV> introduced a numbered capture:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) ) /; # extra parens added
say $m1[0]<named-capture>;
# 「ab」
}
The parens in / ( ... ) / declare a single top level numbered capture. If it matches, then the corresponding match object is stored in $m1[0]. (If your regex looked like / ... ( ... ) ... ( ... ) ... ( ... ) ... / then another match object corresponding to what matches the second pair of parentheses would be stored in $m1[1], another in $m1[2] for the third, and so on.)
The match result for $<named-capture> = ( ab ) is then stored inside $m1[0]. That's why say $m1[0]<named-capture> works.
So far so good. But this is only half the story...
Why $m1[0]<oneCSV> in your code would not work either
While $m1[0]<named-capture> in the immediately above code is working, you would still not get a match object in $m1[0]<oneCSV> in your original code. This is because you also asked for multiple matches of the zeroth capture because you used a * quantifier:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) )* /; # * is a quantifier
say $m1[0][0]<named-capture>;
# 「ab」
}
Because the * quantifier asks for multiple matches, Raku writes a list of match objects into $m1[0]. (In this case there's only one such match so you end up with a list of length 1, i.e. just $m1[0][0] (and not $m1[0][1], $m1[0][2], etc.).)
Summary
Captures nest;
A capture quantified by either * or + corresponds to two levels of nesting not just one.
In your original code, you'd have to write say $m1[0][0]<oneCSV>; to get to the match object you're looking for.
[1] Install relevant modules and write use Text::CSV; (for a pure Raku implementation) or use Text::CSV:from<Perl5>; (for a Perl plus XS implementation) at the start of your code. (talk slides (click on top word, eg. "csv", to advance through slides), video, Raku module, Perl XS module.)
[2] Install CommaIDE and have fun with its awesome grammar/regex development/debugging/analysis features. Or install the Grammar::Tracer; and/or Grammar::Debugger modules and write use Grammar::Tracer; or use Grammar::Debugger; at the start of your code (talk slides, video, modules.)
The match for <oneCSV> lives within the scope of the capture group, which you get via $m1[0].
As the group is quantified with *, the results will again be a list, ie you need another indexing operation to get at a match object, eg $m1[0][0] for the first one.
The named capture can then be accessed by name, eg $m1[0][0]<oneCSV>. This will already contain the match result of the appropriate branch of the protoregex.
If you want the whole list of matches instead of a specific one, you can use >> or map, eg $m1[0]>>.<oneCSV>.

Regex find characters in string

Given the following string
2010-01-01XD2010-01-02XX2010-01-03NX2010-01-04XD2010-01-05DN
I am trying to find all instances of the date followed by one or two characters ie 2010-01-01XD but not where the characters are XX
I have tried
(2010-01-02[^X]{2})|(2010-01-08[^X]{2})|(2010-01-07[^X]{2})|(2010-01-05[^X]{2})|(2010-01-15[^X]{2})
this works if both chars are not X. I have also tried
(2010-01-02[^X]{1,2})|(2010-01-08[^X]{1,2})|(2010-01-07[^X]{1,2})|(2010-01-05[^X]{1,2})|(2010-01-15[^X]{1,2})
this works for for DX but not XD
So trying to be a little clearer
2010-01-01XD
2010-01-01DX
2010-01-01ND
All above should be picked up
2010-01-01XX
And this ignored
You can use this regex based on negative lookahead:
(20[0-9]{2}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12][0-9]|3[01])(?!XX)[A-Z]{2})
RegEx Demo
Easiest way is to use a lookahead assertion (if available).
# (2010-01-01|2010-01-02|2010-01-08|2010-01-07|2010-01-05|2010-01-15)(?!XX)(?i:([a-z]){1,2})
( # (1 start), One of these dates
2010-01-01
| 2010-01-02
| 2010-01-08
| 2010-01-07
| 2010-01-05
| 2010-01-15
) # (1 end)
(?! XX ) # Look ahead assertion, cannot match XX here
(?i: # 1 or 2 of any U/L case letter
( [a-z] ){1,2} # (2)
)
You could likely use a simple pattern with a negtive lookahead such as this:
\d{4}-\d{2}-\d{2}(?!XX)[A-Z]{1,2}
example: http://regex101.com/r/dI1nW4/2
To allow Unicode characters (with the exception of XX) you could use:
\d{4}-\d{2}-\d{2}(?!XX)\D{1,2}
example: http://regex101.com/r/yB5fI0/1
20[0-9]{2}-[01][0-9]-[0-3][0-9]([A-Z][A-WYZ]|[A-WYZ][A-Z])
See it in action.
A negative look ahead is the easiest way to assert the letters not being XX, but there are some simplifications you can make to the alternation by recognising the parts of the date shared by all dates you're trying to match, making this shorter regex:
2010-01-(02|08|07|05|15)(?!XX)[A-Z]{1,2}