How to validate a comma-separated string in Golang? - regex

I'm trying to check that a string is comma-separated "correctly", meaning that
validString := "a, b, c, d, e"
is a valid comma-separated string, as it contains single commas.
Here's an invalid string, as it contains multiple commas and a semicolon:
invalidString1 := "a,, b, c,,,,d, e,,; f, g"
Here's another invalid string, as it contains , , and a place whereby there is no comma between a and b:
invalidString2 := "a b, , c, d"
My first idea was to using the "regexp" package to check that the string is valid using regex patterns discussed elsewhere: Regex for Comma delimited list
package main
import (
"fmt"
"regexp"
)
func main() {
r = regexp.MustCompile("(\d+)(,\s*\d+)*")
}
However, I don't understand how we would use this to "validate" strings...that is, either classify the string as valid or invalid based on these regex patterns.

Once you have the regular expression compiled you can use Match (or MatchString) to check if there is a match e.g. (using a slightly modified regular expression so your examples work):
package main
import (
"fmt"
"regexp"
)
func main() {
r := regexp.MustCompile(`^(\w+)(,\s*\w+)*$`)
fmt.Println(r.MatchString("a, b, c, d, e"))
fmt.Println(r.MatchString("a,, b, c,,,,d, e,,; f, g"))
fmt.Println(r.MatchString("a b, , c, d"))
}
Try it in the go playground.
There are plenty of other ways of checking for a valid comma separated list; which is best depends upon your use case. If you are loading the data into a slice (or other data structure) then its often simplest to just check the values as you process them into the structure.

In case that performance are a key factor, you can simply remove every whitespace and proceed with the following algorithm:
func validate(str string) bool {
for i, c := range str {
if i%2 != 0 {
if c != ',' {
return false
}
}
}
return true
}
Here the benchmark
BenchmarkValidate-8 362536340 3.27 ns/op 0 B/op 0 allocs/op
BenchmarkValidateRegex-8 13636486 87.4 ns/op 0 B/op 0 allocs/op
Note:
The procedure work only if the letter have no space, cause rely on the fact that we need to validate a sequence of "CHARACTER-SYMBOL-CHARACTER-SYMBOL"
Code for the benchmark
func BenchmarkValidate(b *testing.B) {
invalidString1 := "a,, b, c,,,,d, e,,; f, g"
invalidString1 = strings.ReplaceAll(invalidString1, " ", "")
for x := 0; x < b.N; x++ {
validate(invalidString1)
}
}
func BenchmarkValidateRegex(b *testing.B) {
r := regexp.MustCompile(`^(\w+)(,\s*\w+)*$`)
invalidString1 := "a,, b, c,,,,d, e,,; f, g"
for x := 0; x < b.N; x++ {
r.MatchString(invalidString1)
}
}

Related

How to extract different variables from a regex without compiling each expression

I have a struct representing sizes of computer objects. Objects of this struct are constructed from string values input by users; e.g. "50KB" would be tokenised into an int value of "50" and the string value "KB".
type SizeUnit string
const (
B = "B"
KB = "KB"
MB = "MB"
GB = "GB"
TB = "TB"
)
type ObjectSize struct {
NumberOfUnits int
Unit SizeUnit
}
func NewObjectSizeFromString(input_str string) (*ObjectSize, error)
In the body of this function, I first check if the input value is in the valid format; i.e. any number of digits, followed by any one of "B", "KB", "MB", "GB" or "TB". I then extract the int and string components separately and return a pointer to a struct.
In order to do these three things though, I'm having to compile the regex three times.
The first time to check the format of the input string
rg, err := regexp.Compile(`^[0-9]+B$|KB$|MB$|GB$|TB$`)
And then compile again to fetch the int component:
rg, err := regexp.Compile(`^[0-9]+`)
rg.FindString(input_str)
And then compile again to fetch the string/units component:
rg, err := regexp.Compile(`B$|KB$|MB$|GB$|TB$`)
rg.FindString(input_str)
Is there any way to get the two components from the input string with a single regex compilation?
The full code can be found on the Go Playground.
I should point out that this is an academic question as I'm experimenting with Go's regex library. For a simple use-case of this sort, I would probably use a simple for loop to parse the input string.
You can capture both the values with a single expression using regexp.FindStringSubmatch:
func NewObjectSizeFromString(input_str string) (*ObjectSize, error) {
var defaultReturn *ObjectSize = nil
full_search_pattern := `^([0-9]+)([KMGT]?B)$`
rg, err := regexp.Compile(full_search_pattern)
if err != nil {
return defaultReturn, errors.New("Could not compile search expression")
}
matched := rg.FindStringSubmatch(input_str)
if matched == nil {
return defaultReturn, errors.New("Not in valid format")
}
i, err := strconv.ParseInt(matched[1], 10, 32)
return &ObjectSize{int(i), SizeUnit(matched[2])}, nil
}
See the playground.
The ^([0-9]+)([KMGT]?B)$ regex matches
^ - start of string
([0-9]+) - Group 1 (this value will be held in matched[1]): one or more digits
([KMGT]?B) - Group 2 (it will be in matched[2]): an optional K, M, G, T letter, and then a B letter
$ - end of string.
Note that matched[0] will hold the whole match.

Find all strings in between two strings in Go

I am working on extracting mutliple matches between two strings.
In the example below, I am trying to regex out an A B C substring out of my string.
Here is my code:
package main
import (
"fmt"
"regexp"
)
func main() {
str:= "Movies: A B C Food: 1 2 3"
re := regexp.MustCompile(`[Movies:][^Food:]*`)
match := re.FindAllString(str, -1)
fmt.Println(match)
}
I am clearly doing something wrong in my regex. I am trying to get the A B C string between Movies: and Food:.
What is the proper regex to get all strings between two strings?
In Go, since its RE2-based regexp does not support lookarounds, you need to use capturing mechanism with regexp.FindAllStringSubmatch function:
left := "LEFT_DELIMITER_TEXT_HERE"
right := "RIGHT_DELIMITER_TEXT_HERE"
rx := regexp.MustCompile(`(?s)` + regexp.QuoteMeta(left) + `(.*?)` + regexp.QuoteMeta(right))
matches := rx.FindAllStringSubmatch(str, -1)
Note the use of regexp.QuoteMeta that automatically escapes all special regex metacharacters in the left- and right-hand delimiters.
The (?s) makes . match across lines and (.*?) captures all between ABC and XYZ into Group 1.
So, here you can use
package main
import (
"fmt"
"regexp"
)
func main() {
str:= "Movies: A B C Food: 1 2 3"
r := regexp.MustCompile(`Movies:\s*(.*?)\s*Food`)
matches := r.FindAllStringSubmatch(str, -1)
for _, v := range matches {
fmt.Println(v[1])
}
}
See the Go demo. Output: A B C.

Why this regular expression in Golang is non greedy? [duplicate]

This question already has answers here:
Lazy quantifier {,}? not working as I would expect
(3 answers)
Closed 2 years ago.
Here is a simple regular expression:
package main
import (
"fmt"
"regexp"
)
const data = "abcdefghijklmn"
func main() {
r, err := regexp.Compile(".{1,6}")
if err != nil {
panic(err)
}
for _, d := range r.FindAllIndex([]byte(data), -1) {
fmt.Println(data[d[0]:d[1]])
}
}
And we know it is greedy:
abcdef
ghijkl
mn
Now, we can add a ? after the expression to make it non greedy:
package main
import (
"fmt"
"regexp"
)
const data = "abcdefghijklmn"
func main() {
r, err := regexp.Compile(".{1,6}?")
if err != nil {
panic(err)
}
for _, d := range r.FindAllIndex([]byte(data), -1) {
fmt.Println(data[d[0]:d[1]])
}
}
And we can get:
a
b
c
d
e
f
g
h
i
j
k
l
m
n
However, if we add other chars after the expression, it becomes greedy:
package main
import (
"fmt"
"regexp"
)
const data = "abcdefghijklmn"
func main() {
r, err := regexp.Compile(".{1,6}?k")
if err != nil {
panic(err)
}
for _, d := range r.FindAllIndex([]byte(data), -1) {
fmt.Println(data[d[0]:d[1]])
}
}
And we get:
efghijk
So why it becomes greedy if we add a char after it?
Adding a lazy quantifier after a repetition count changes it from matching as many as possible, to as few as possible.
However, this does not change the fact that the string must be processed serially. This is where your two cases differ:
.{1,6}? returns one character at a time because this is the fewest matches as the string is being processed. The lazy quantifier lets the engine match after a single character, not needing to keep processing the string.
.{1,6}?k has to skip over abcd to get a match, but it then finds the substring starting at e to be a match. A lazy quantifier does not let the engine move to the next character in the string.
In short: matching from the current position takes precedence over moving to the next position in the hope of a smaller match.
As for your question about making it lazy again, you can't. You'll have to find a different regular expression for the output you want.

How to match all overlapping pattern

I want to get the indexes of the following pattern (\.\.#\.\.) in the following string :
...#...#....#.....#..#..#..#.......
But Golang does not manage overlapping matching.
Thus I got : [[1 6 1 6] [10 15 10 15] [16 21 16 21] [22 27 22 27]]
As one can see, two points . do precede and suffix the second # but it's not return by the method FindAllStringSubmatchIndex.
I tried to use different methods from regexp without success. Searching the documentation, I found nothing useful on https://golang.org/pkg/regexp and https://golang.org/src/regexp/regexp.go
On the contrary, it seems regexp does not natively support this functionality :
// If 'All' is present, the routine matches successive non-overlapping matches of the entire expression.
I can solve the issue but since I am doing this exercise to learn Golang, I want to know. thanks :)
Here is my code for reference :
matches := r.pattern.FindAllStringSubmatchIndex(startingState)
fmt.Println(r.pattern)
fmt.Println(matches)
for _, m := range matches {
tempState = tempState[:m[0]+2] + "#" + tempState[m[0]+3:]
fmt.Println(tempState)
}
There's no reason to use a regex for this. Regex is overkill for such a simple task--it's over complex, and less efficient. Instead you should just use strings.Index, and a for loop:
input := "...#...#....#.....#..#..#..#......."
idx := []int{}
j := 0
for {
i := strings.Index(input[j:], "..#..")
if i == -1 {
break
}
fmt.Println(j)
idx = append(idx, j+i)
j += i+1
}
fmt.Println("Indexes:", idx)
Playground link
Go is for programmers. For example,
package main
import (
"fmt"
"strings"
)
func findIndices(haystack, needle string) []int {
var x []int
for i := 0; i < len(haystack)-len(needle); i++ {
j := strings.Index(haystack[i:], needle)
if j < 0 {
break
}
i += j
x = append(x, i)
}
return x
}
func main() {
haystack := `...#...#....#.....#..#..#..#.......`
needle := `..#..`
fmt.Println(findIndices(haystack, needle))
}
Playground: https://play.golang.org/p/nNE5IB1feQT
Output:
[1 5 10 16 19 22 25]
Regular Expression References:
Regular Expression Matching Can Be Simple And Fast
Implementing Regular Expressions
Package [regexp/]syntax

Split string using regular expression in Go

I'm trying to find a good way to split a string using a regular expression instead of a string. Thanks
http://nsf.github.io/go/strings.html?f:Split!
You can use regexp.Split to split a string into a slice of strings with the regex pattern as the delimiter.
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile("[0-9]+")
txt := "Have9834a908123great10891819081day!"
split := re.Split(txt, -1)
set := []string{}
for i := range split {
set = append(set, split[i])
}
fmt.Println(set) // ["Have", "a", "great", "day!"]
}
I made a regex-split function based on the behavior of regex split function in java, c#, php.... It returns only an array of strings, without the index information.
func RegSplit(text string, delimeter string) []string {
reg := regexp.MustCompile(delimeter)
indexes := reg.FindAllStringIndex(text, -1)
laststart := 0
result := make([]string, len(indexes) + 1)
for i, element := range indexes {
result[i] = text[laststart:element[0]]
laststart = element[1]
}
result[len(indexes)] = text[laststart:len(text)]
return result
}
example:
fmt.Println(RegSplit("a1b22c333d", "[0-9]+"))
result:
[a b c d]
If you just want to split on certain characters, you can use strings.FieldsFunc, otherwise I'd go with regexp.FindAllString.
The regexp.Split() function would be the best way to do this.
You should be able to create your own split function that loops over the results of RegExp.FindAllString, placing the intervening substrings into a new array.
http://nsf.github.com/go/regexp.html?m:Regexp.FindAllString!
I found this old post while looking for an answer. I'm new to Go but these answers seem overly complex for the current version of Go. The simple function below returns the same result as those above.
package main
import (
"fmt"
"regexp"
)
func goReSplit(text string, pattern string) []string {
regex := regexp.MustCompile(pattern)
result := regex.Split(text, -1)
return result
}
func main() {
fmt.Printf("%#v\n", goReSplit("Have9834a908123great10891819081day!", "[0-9]+"))
}