How to match all overlapping pattern - regex

I want to get the indexes of the following pattern (\.\.#\.\.) in the following string :
...#...#....#.....#..#..#..#.......
But Golang does not manage overlapping matching.
Thus I got : [[1 6 1 6] [10 15 10 15] [16 21 16 21] [22 27 22 27]]
As one can see, two points . do precede and suffix the second # but it's not return by the method FindAllStringSubmatchIndex.
I tried to use different methods from regexp without success. Searching the documentation, I found nothing useful on https://golang.org/pkg/regexp and https://golang.org/src/regexp/regexp.go
On the contrary, it seems regexp does not natively support this functionality :
// If 'All' is present, the routine matches successive non-overlapping matches of the entire expression.
I can solve the issue but since I am doing this exercise to learn Golang, I want to know. thanks :)
Here is my code for reference :
matches := r.pattern.FindAllStringSubmatchIndex(startingState)
fmt.Println(r.pattern)
fmt.Println(matches)
for _, m := range matches {
tempState = tempState[:m[0]+2] + "#" + tempState[m[0]+3:]
fmt.Println(tempState)
}

There's no reason to use a regex for this. Regex is overkill for such a simple task--it's over complex, and less efficient. Instead you should just use strings.Index, and a for loop:
input := "...#...#....#.....#..#..#..#......."
idx := []int{}
j := 0
for {
i := strings.Index(input[j:], "..#..")
if i == -1 {
break
}
fmt.Println(j)
idx = append(idx, j+i)
j += i+1
}
fmt.Println("Indexes:", idx)
Playground link

Go is for programmers. For example,
package main
import (
"fmt"
"strings"
)
func findIndices(haystack, needle string) []int {
var x []int
for i := 0; i < len(haystack)-len(needle); i++ {
j := strings.Index(haystack[i:], needle)
if j < 0 {
break
}
i += j
x = append(x, i)
}
return x
}
func main() {
haystack := `...#...#....#.....#..#..#..#.......`
needle := `..#..`
fmt.Println(findIndices(haystack, needle))
}
Playground: https://play.golang.org/p/nNE5IB1feQT
Output:
[1 5 10 16 19 22 25]
Regular Expression References:
Regular Expression Matching Can Be Simple And Fast
Implementing Regular Expressions
Package [regexp/]syntax

Related

Behavior of optional word matches in the regex '^(al |ali |alia |alias ) ' [duplicate]

I have a weird issue working with the Javascript Regexp.exec function. When calling multiple time the function on new (I guess ...) regexp objects, it works one time every two. I don't get why at all!
Here is a little loop example but it does the same thing when used one time in a function and called multiple times.
for (var i = 0; i < 5; ++i) {
console.log(i, (/(b)/g).exec('abc'));
}
> 0 ["b", "b"]
> 1 null
> 2 ["b", "b"]
> 3 null
> 4 ["b", "b"]
When removing the /g, it gets back to normal.
for (var i = 0; i < 5; ++i) {
console.log(i, (/(b)/).exec('abc'));
} /* no g ^ */
> 0 ["b", "b"]
> 1 ["b", "b"]
> 2 ["b", "b"]
> 3 ["b", "b"]
> 4 ["b", "b"]
I guess that there is an optimization, saving the regexp object, but it seems strange.
This behaviour is the same on Chrome 4 and Firefox 3.6, however it works as (I) expected in IE8. I believe that is intended but I can't find the logic in there, maybe you will be able to help me!
Thanks
If you're going to reuse the same regular expression anyway, take it out of the loop and explicitly reset it:
var pattern = /(b)/g;
for (var i = 0; i < 5; ++i) {
pattern.lastIndex = 0;
console.log(i + ' ' + pattern.exec("abc"));
}
/g is not intended to work for simple matching:
/g enables "global" matching. When using the replace() method, specify this modifier to replace all matches, rather than only the first one.
I'd imagine internally javascript holds the matching after the capture, so it would be able to resume matching and therefore null is returned since b occur only once in the subject. Compare:
for (var i = 0; i < 5; ++i) {
console.log(i +' ' + (/(b+)/g).exec("abbcb"));
}
returns:
0 bb,bb
1 b,b
2 null
3 bb,bb
4 b,b
Thanks :)
I found an interesting side effet, it's possible make a static variable (in sense of C, global but only visible from the function) without closure!
function test () {
var static = /a/g;
if ('count' in static) {
static.count++;
} else {
static.count = 1;
}
console.log(static.count);
}
for (var i = 0; i < 5; ++i) { test(); }
1
2
3
4
5
(I'm making a new answer because we can't put code inside a comment)

Find all strings in between two strings in Go

I am working on extracting mutliple matches between two strings.
In the example below, I am trying to regex out an A B C substring out of my string.
Here is my code:
package main
import (
"fmt"
"regexp"
)
func main() {
str:= "Movies: A B C Food: 1 2 3"
re := regexp.MustCompile(`[Movies:][^Food:]*`)
match := re.FindAllString(str, -1)
fmt.Println(match)
}
I am clearly doing something wrong in my regex. I am trying to get the A B C string between Movies: and Food:.
What is the proper regex to get all strings between two strings?
In Go, since its RE2-based regexp does not support lookarounds, you need to use capturing mechanism with regexp.FindAllStringSubmatch function:
left := "LEFT_DELIMITER_TEXT_HERE"
right := "RIGHT_DELIMITER_TEXT_HERE"
rx := regexp.MustCompile(`(?s)` + regexp.QuoteMeta(left) + `(.*?)` + regexp.QuoteMeta(right))
matches := rx.FindAllStringSubmatch(str, -1)
Note the use of regexp.QuoteMeta that automatically escapes all special regex metacharacters in the left- and right-hand delimiters.
The (?s) makes . match across lines and (.*?) captures all between ABC and XYZ into Group 1.
So, here you can use
package main
import (
"fmt"
"regexp"
)
func main() {
str:= "Movies: A B C Food: 1 2 3"
r := regexp.MustCompile(`Movies:\s*(.*?)\s*Food`)
matches := r.FindAllStringSubmatch(str, -1)
for _, v := range matches {
fmt.Println(v[1])
}
}
See the Go demo. Output: A B C.

How to validate a comma-separated string in Golang?

I'm trying to check that a string is comma-separated "correctly", meaning that
validString := "a, b, c, d, e"
is a valid comma-separated string, as it contains single commas.
Here's an invalid string, as it contains multiple commas and a semicolon:
invalidString1 := "a,, b, c,,,,d, e,,; f, g"
Here's another invalid string, as it contains , , and a place whereby there is no comma between a and b:
invalidString2 := "a b, , c, d"
My first idea was to using the "regexp" package to check that the string is valid using regex patterns discussed elsewhere: Regex for Comma delimited list
package main
import (
"fmt"
"regexp"
)
func main() {
r = regexp.MustCompile("(\d+)(,\s*\d+)*")
}
However, I don't understand how we would use this to "validate" strings...that is, either classify the string as valid or invalid based on these regex patterns.
Once you have the regular expression compiled you can use Match (or MatchString) to check if there is a match e.g. (using a slightly modified regular expression so your examples work):
package main
import (
"fmt"
"regexp"
)
func main() {
r := regexp.MustCompile(`^(\w+)(,\s*\w+)*$`)
fmt.Println(r.MatchString("a, b, c, d, e"))
fmt.Println(r.MatchString("a,, b, c,,,,d, e,,; f, g"))
fmt.Println(r.MatchString("a b, , c, d"))
}
Try it in the go playground.
There are plenty of other ways of checking for a valid comma separated list; which is best depends upon your use case. If you are loading the data into a slice (or other data structure) then its often simplest to just check the values as you process them into the structure.
In case that performance are a key factor, you can simply remove every whitespace and proceed with the following algorithm:
func validate(str string) bool {
for i, c := range str {
if i%2 != 0 {
if c != ',' {
return false
}
}
}
return true
}
Here the benchmark
BenchmarkValidate-8 362536340 3.27 ns/op 0 B/op 0 allocs/op
BenchmarkValidateRegex-8 13636486 87.4 ns/op 0 B/op 0 allocs/op
Note:
The procedure work only if the letter have no space, cause rely on the fact that we need to validate a sequence of "CHARACTER-SYMBOL-CHARACTER-SYMBOL"
Code for the benchmark
func BenchmarkValidate(b *testing.B) {
invalidString1 := "a,, b, c,,,,d, e,,; f, g"
invalidString1 = strings.ReplaceAll(invalidString1, " ", "")
for x := 0; x < b.N; x++ {
validate(invalidString1)
}
}
func BenchmarkValidateRegex(b *testing.B) {
r := regexp.MustCompile(`^(\w+)(,\s*\w+)*$`)
invalidString1 := "a,, b, c,,,,d, e,,; f, g"
for x := 0; x < b.N; x++ {
r.MatchString(invalidString1)
}
}

Why this regular expression in Golang is non greedy? [duplicate]

This question already has answers here:
Lazy quantifier {,}? not working as I would expect
(3 answers)
Closed 2 years ago.
Here is a simple regular expression:
package main
import (
"fmt"
"regexp"
)
const data = "abcdefghijklmn"
func main() {
r, err := regexp.Compile(".{1,6}")
if err != nil {
panic(err)
}
for _, d := range r.FindAllIndex([]byte(data), -1) {
fmt.Println(data[d[0]:d[1]])
}
}
And we know it is greedy:
abcdef
ghijkl
mn
Now, we can add a ? after the expression to make it non greedy:
package main
import (
"fmt"
"regexp"
)
const data = "abcdefghijklmn"
func main() {
r, err := regexp.Compile(".{1,6}?")
if err != nil {
panic(err)
}
for _, d := range r.FindAllIndex([]byte(data), -1) {
fmt.Println(data[d[0]:d[1]])
}
}
And we can get:
a
b
c
d
e
f
g
h
i
j
k
l
m
n
However, if we add other chars after the expression, it becomes greedy:
package main
import (
"fmt"
"regexp"
)
const data = "abcdefghijklmn"
func main() {
r, err := regexp.Compile(".{1,6}?k")
if err != nil {
panic(err)
}
for _, d := range r.FindAllIndex([]byte(data), -1) {
fmt.Println(data[d[0]:d[1]])
}
}
And we get:
efghijk
So why it becomes greedy if we add a char after it?
Adding a lazy quantifier after a repetition count changes it from matching as many as possible, to as few as possible.
However, this does not change the fact that the string must be processed serially. This is where your two cases differ:
.{1,6}? returns one character at a time because this is the fewest matches as the string is being processed. The lazy quantifier lets the engine match after a single character, not needing to keep processing the string.
.{1,6}?k has to skip over abcd to get a match, but it then finds the substring starting at e to be a match. A lazy quantifier does not let the engine move to the next character in the string.
In short: matching from the current position takes precedence over moving to the next position in the hope of a smaller match.
As for your question about making it lazy again, you can't. You'll have to find a different regular expression for the output you want.

Positive lookahead + overlapping matches regex

I'm looking for a regex to match all % that are not followed by a valid 2-characters hex code (2 characters in a-fA-F0-9). I came up with (%)(?=([0-9a-fA-F][^0-9a-fA-F]|[^0-9a-fA-F])) which works well but is not supported in golang, because of the positive lookahead (?=).
How can I translate it (or maybe make it simpler?), so that it works with go?
For example, given the string %d%2524e%25f%255E00%%%252611%25, it should match the first % and the first two ones of the %%% substring.
ie: https://regex101.com/r/y0YQ1I/2
I only tried this on regex101 (marked golang regex), but it seems that it works as expected:
%[0-9a-fA-F][0-9a-fA-F]|(%)
or simpler:
%[0-9a-fA-F]{2}|(%)
The real challenge here is that the matches at position 19 and 20 are overlapping, which means we can't use any of the go builtin "FindAll..." functions since they only find non-overlapping matches. This means that we've got to match the regex repeatedly against substrings starting after subsequent match indices if we want to find them all.
For the regex itself I've used a non-capturing group (?:...) instead of a lookahead assertion. Additionally, the regex will also match percent-signs at the end of the string, since they cannot be followed by two hex digits:
func findPlainPercentIndices(s string) []int {
re := regexp.MustCompile(`%(?:[[:xdigit:]][[:^xdigit:]]|[[:^xdigit:]]|$)`)
indices := []int{}
idx := 0
for {
m := re.FindStringIndex(s[idx:])
if m == nil {
break
}
nextidx := idx + m[0]
indices = append(indices, nextidx)
idx = nextidx + 1
}
return indices
}
func main() {
str := "%d%2524e%25f%255E00%%%252611%25%%"
// 012345678901234567890123456789012
// 0 1 2 3
fmt.Printf("OK: %#v\n", findPlainPercentIndices(str))
// OK: []int{0, 19, 20, 31, 32}
}