unsupported Perl syntax: `(?<` - regex

I want to parse the result of the cmd 'gpg --list-keys' to display it on the browser.
The cmd ouput is like this:
pub rsa3072 2021-08-03 [SC] [expires: 2023-08-03]
07C47E284765D5593171C18F00B11D51A071CB55
uid [ultimate] user1 <user1#example.com>
sub rsa3072 2021-08-03 [E] [expires: 2023-08-03]
pub rsa3072 2021-08-04 [SC]
37709ABD4D96324AB8CBFC3B441812AFBCE7A013
uid [ultimate] user2 <user2#example.com>
sub rsa3072 2021-08-04 [E]
I expect something like this :
{
{uid : user1#example.com},
{uid : user2#example.com},
}
Here is the code:
type GPGList struct{
uid string
}
//find list keys
func Findlistkeys(){
pathexec, _ := exec.LookPath("gpg")
cmd := exec.Command(pathexec, "--list-keys")
cmdOutput := &bytes.Buffer{}
cmd.Stdout = cmdOutput
printCommand(cmd)
err := cmd.Run()
printError(err)
output := cmdOutput.Bytes()
printOutput(output)
GPG := GPGList{}
parseOutput(output, &GPG)
fmt.Println(GPG)
}
func printCommand(cmd *exec.Cmd) {
fmt.Printf("==> Executing: %s\n", strings.Join(cmd.Args, " "))
}
func printError(err error) {
if err != nil {
os.Stderr.WriteString(fmt.Sprintf("==> Error: %s\n", err.Error()))
}
}
func printOutput(outs []byte) {
if len(outs) > 0 {
fmt.Printf("==> Output: %s\n", string(outs))
}
}
func parseOutput(outs []byte, GPG *GPGList) {
var uid = regexp.MustCompile(`(?<=\<)(.*?)(?=\>)`)
fmt.Println(uid)
}
It ends with the following message :
panic: regexp: Compile(`(?<=\<)(.*?)(?=\>)`): error parsing regexp: invalid or unsupported Perl syntax: `(?<
So far I'm stack with the regex.
It don't understand why it don't want to compile...
What is wrong with it?
I've tested the regex on online simulator and it looks OK, yet there is something wrong with it.
Any suggestion please?

The regexp package uses the syntax accepted by RE2. From https://github.com/google/re2/wiki/Syntax
(?<=re) after text matching re (NOT SUPPORTED)
Hence the error message:
error parsing regexp: invalid or unsupported Perl syntax: (?<
The online simulator is likely testing a different regular expression syntax. You will need to find an alternative regular expression encoding or a different regular expression package.
An alternative encoding you can try is \<([^\>]*)\> (playground). This is quite simple and may not match your original intent.

Here is another solution based on gpg --list-keys --with-colons machine readable output.
It is still a slow solution, but easy to write, easy to update, does not use regular expressions.
A smart folk can come with an even faster solution without adding a crazy wall of complexity. (just loop over the string until < then capture the string until >)
this is based on a simple csv reader, so you can plug it onto the output stream of a command.Exec instance, or whatever else.
The big advantage is that it does not need to buffer the whole data in memory, it can stream decode.
package main
import (
"encoding/csv"
"fmt"
"io"
"regexp"
"strings"
)
func main() {
fmt.Printf("%#v\n", extractEmailsCSV(csvInput))
}
var uid = regexp.MustCompile(`\<(.*?)\>`)
func extractEmailsRegexp(input string) (out []string) {
submatchall := uid.FindAllString(input, -1)
for _, element := range submatchall {
element = strings.Trim(element, "<")
element = strings.Trim(element, ">")
out = append(out, element)
}
return
}
func extractEmailsCSV(input string) (out []string) {
r := strings.NewReader(input)
csv := csv.NewReader(r)
csv.Comma = ':'
csv.ReuseRecord = true
csv.FieldsPerRecord = -1
for {
records, err := csv.Read()
if err == io.EOF {
break
} else if err != nil {
panic(err)
}
if len(records) < 10 {
continue
}
r := records[9]
if strings.Contains(r, "#") {
begin := strings.Index(r, "<")
end := strings.Index(r, ">")
if begin+end > 0 {
out = append(out, r[begin+1:end])
}
}
}
return
}
var regexpInput = `
pub rsa3072 2021-08-03 [SC] [expires: 2023-08-03]
07C47E284765D5593171C18F00B11D51A071CB55
uid [ultimate] user1 <user1#example.com>
sub rsa3072 2021-08-03 [E] [expires: 2023-08-03]
pub rsa3072 2021-08-04 [SC]
37709ABD4D96324AB8CBFC3B441812AFBCE7A013
uid [ultimate] user2 <user2#example.com>
sub rsa3072 2021-08-04 [E]
`
var csvInput = `pub:u:1024:17:51FF9A17136C5B87:1999-04-24::59:-:Tony Nelson <tnelson#techie.com>:
uid:u::::::::Tony Nelson <tnelson#conceptech.com>:
`
We dont exactly have the same benchmark setup, but anyways. If you think it bloats the comparison feel free to provide better bench setup.
Here is the benchmark setup
package main
import (
"strings"
"testing"
)
func BenchmarkCSV_1(b *testing.B) {
input := strings.Repeat(csvInput, 1)
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = extractEmailsCSV(input)
}
}
func BenchmarkRegExp_1(b *testing.B) {
input := strings.Repeat(regexpInput, 1)
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = extractEmailsRegexp(input)
}
}
func BenchmarkCSV_10(b *testing.B) {
input := strings.Repeat(csvInput, 10)
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = extractEmailsCSV(input)
}
}
func BenchmarkRegExp_10(b *testing.B) {
input := strings.Repeat(regexpInput, 10)
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = extractEmailsRegexp(input)
}
}
func BenchmarkCSV_100(b *testing.B) {
input := strings.Repeat(csvInput, 100)
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = extractEmailsCSV(input)
}
}
func BenchmarkRegExp_100(b *testing.B) {
input := strings.Repeat(regexpInput, 100)
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = extractEmailsRegexp(input)
}
}
And here is the result
BenchmarkCSV_1
BenchmarkCSV_1-4 242736 4200 ns/op 5072 B/op 18 allocs/op
BenchmarkRegExp_1
BenchmarkRegExp_1-4 252232 4466 ns/op 400 B/op 9 allocs/op
BenchmarkCSV_10
BenchmarkCSV_10-4 68257 17335 ns/op 7184 B/op 40 allocs/op
BenchmarkRegExp_10
BenchmarkRegExp_10-4 29871 39947 ns/op 3414 B/op 68 allocs/op
BenchmarkCSV_100
BenchmarkCSV_100-4 7538 141609 ns/op 25872 B/op 223 allocs/op
BenchmarkRegExp_100
BenchmarkRegExp_100-4 1726 674718 ns/op 37858 B/op 615 allocs/op
In terms of raw speed and allocations regular expression is better on small dataset, though as soon there is a little bit of data regular expressions are slower and allocates mores by a significant factor.
read also https://pkg.go.dev/testing
My conclusion is, don't use regular expressions ... also, optimizing regexp are hard if not impossible, where as optimizing an algorithm to parse some text input is doable, if not easy.
to summarize, even the fastest and best runtime is nothing without a well thought programmer to drive it.

So I updated the regex...but since (?<=\<)(.*?)(?=\>) was working on online simulator, I really got surprised.
Why can't regex work the same with all languages...
func parseOutput(outs []byte, GPG *GPGList) {
var uid = regexp.MustCompile(`\<(.*?)\>`)
submatchall := uid.FindAllString(string(outs), -1)
for _, element := range submatchall {
element = strings.Trim(element, "<")
element = strings.Trim(element, ">")
fmt.Println(element)
}
}

Related

Issues with regex. nil slice and FindStringSubmatch

I'm trying to regex a pattern:
random-text (800)
I'm doing something like this:
func main() {
rando := "random-text (800)"
parsedThing := regexp.MustCompile(`\((.*?)\)`)
match := parsedThing.FindStringSubmatch(rando)
if match[1] == "" {
fmt.Println("do a thing")
}
if match[1] != "" {
fmt.Println("do a thing")
}
}
I only want to capture what's in the parentheses but FindString is parsing the (). I've also tried FindStringSubmatch, which is great I can specify the capture group in the slice but then I have an error in my unit test, that the slice is . I need to test for an empty string as that's a thing that could happen. Is there a better regex, that I can use that will only capture inside the parentheses? Or is there a better way to error handle for an nil slice.
I usually compare against nil, based on the documentation:
A return value of nil indicates no match.
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile(`\((.+)\)`)
find := re.FindStringSubmatch("random-text (800)")
if find != nil {
fmt.Println(find[1] == "800")
}
}

How to validate a comma-separated string in Golang?

I'm trying to check that a string is comma-separated "correctly", meaning that
validString := "a, b, c, d, e"
is a valid comma-separated string, as it contains single commas.
Here's an invalid string, as it contains multiple commas and a semicolon:
invalidString1 := "a,, b, c,,,,d, e,,; f, g"
Here's another invalid string, as it contains , , and a place whereby there is no comma between a and b:
invalidString2 := "a b, , c, d"
My first idea was to using the "regexp" package to check that the string is valid using regex patterns discussed elsewhere: Regex for Comma delimited list
package main
import (
"fmt"
"regexp"
)
func main() {
r = regexp.MustCompile("(\d+)(,\s*\d+)*")
}
However, I don't understand how we would use this to "validate" strings...that is, either classify the string as valid or invalid based on these regex patterns.
Once you have the regular expression compiled you can use Match (or MatchString) to check if there is a match e.g. (using a slightly modified regular expression so your examples work):
package main
import (
"fmt"
"regexp"
)
func main() {
r := regexp.MustCompile(`^(\w+)(,\s*\w+)*$`)
fmt.Println(r.MatchString("a, b, c, d, e"))
fmt.Println(r.MatchString("a,, b, c,,,,d, e,,; f, g"))
fmt.Println(r.MatchString("a b, , c, d"))
}
Try it in the go playground.
There are plenty of other ways of checking for a valid comma separated list; which is best depends upon your use case. If you are loading the data into a slice (or other data structure) then its often simplest to just check the values as you process them into the structure.
In case that performance are a key factor, you can simply remove every whitespace and proceed with the following algorithm:
func validate(str string) bool {
for i, c := range str {
if i%2 != 0 {
if c != ',' {
return false
}
}
}
return true
}
Here the benchmark
BenchmarkValidate-8 362536340 3.27 ns/op 0 B/op 0 allocs/op
BenchmarkValidateRegex-8 13636486 87.4 ns/op 0 B/op 0 allocs/op
Note:
The procedure work only if the letter have no space, cause rely on the fact that we need to validate a sequence of "CHARACTER-SYMBOL-CHARACTER-SYMBOL"
Code for the benchmark
func BenchmarkValidate(b *testing.B) {
invalidString1 := "a,, b, c,,,,d, e,,; f, g"
invalidString1 = strings.ReplaceAll(invalidString1, " ", "")
for x := 0; x < b.N; x++ {
validate(invalidString1)
}
}
func BenchmarkValidateRegex(b *testing.B) {
r := regexp.MustCompile(`^(\w+)(,\s*\w+)*$`)
invalidString1 := "a,, b, c,,,,d, e,,; f, g"
for x := 0; x < b.N; x++ {
r.MatchString(invalidString1)
}
}

Selecting text between borders using regexp in Go [duplicate]

content := `{null,"Age":24,"Balance":33.23}`
rule,_ := regexp.Compile(`"([^\"]+)"`)
results := rule.FindAllString(content,-1)
fmt.Println(results[0]) //"Age"
fmt.Println(results[1]) //"Balance"
There is a json string with a ``null`` value that it look like this.
This json is from a web api and i don't want to replace anything inside.
I want to using regex to match all the keys in this json which are without the double quote and the output are ``Age`` and ``Balance`` but not ``"Age"`` and ``"Balance"``.
How can I achieve this?
One solution would be to use a regular expression that matches any character between quotes (such as your example or ".*?") and either put a matching group (aka "submatch") inside the quotes or return the relevant substring of the match, using regexp.FindAllStringSubmatch(...) or regexp.FindAllString(...), respectively.
For example (Go Playground):
func main() {
str := `{null,"Age":24,"Balance":33.23}`
fmt.Printf("OK1: %#v\n", getQuotedStrings1(str))
// OK1: []string{"Age", "Balance"}
fmt.Printf("OK2: %#v\n", getQuotedStrings2(str))
// OK2: []string{"Age", "Balance"}
}
var re1 = regexp.MustCompile(`"(.*?)"`) // Note the matching group (submatch).
func getQuotedStrings1(s string) []string {
ms := re1.FindAllStringSubmatch(s, -1)
ss := make([]string, len(ms))
for i, m := range ms {
ss[i] = m[1]
}
return ss
}
var re2 = regexp.MustCompile(`".*?"`)
func getQuotedStrings2(s string) []string {
ms := re2.FindAllString(s, -1)
ss := make([]string, len(ms))
for i, m := range ms {
ss[i] = m[1 : len(m)-1] // Note the substring of the match.
}
return ss
}
Note that the second version (without a submatching group) may be slightly faster based on a simple benchmark, if performance is critical.

How to match all overlapping pattern

I want to get the indexes of the following pattern (\.\.#\.\.) in the following string :
...#...#....#.....#..#..#..#.......
But Golang does not manage overlapping matching.
Thus I got : [[1 6 1 6] [10 15 10 15] [16 21 16 21] [22 27 22 27]]
As one can see, two points . do precede and suffix the second # but it's not return by the method FindAllStringSubmatchIndex.
I tried to use different methods from regexp without success. Searching the documentation, I found nothing useful on https://golang.org/pkg/regexp and https://golang.org/src/regexp/regexp.go
On the contrary, it seems regexp does not natively support this functionality :
// If 'All' is present, the routine matches successive non-overlapping matches of the entire expression.
I can solve the issue but since I am doing this exercise to learn Golang, I want to know. thanks :)
Here is my code for reference :
matches := r.pattern.FindAllStringSubmatchIndex(startingState)
fmt.Println(r.pattern)
fmt.Println(matches)
for _, m := range matches {
tempState = tempState[:m[0]+2] + "#" + tempState[m[0]+3:]
fmt.Println(tempState)
}
There's no reason to use a regex for this. Regex is overkill for such a simple task--it's over complex, and less efficient. Instead you should just use strings.Index, and a for loop:
input := "...#...#....#.....#..#..#..#......."
idx := []int{}
j := 0
for {
i := strings.Index(input[j:], "..#..")
if i == -1 {
break
}
fmt.Println(j)
idx = append(idx, j+i)
j += i+1
}
fmt.Println("Indexes:", idx)
Playground link
Go is for programmers. For example,
package main
import (
"fmt"
"strings"
)
func findIndices(haystack, needle string) []int {
var x []int
for i := 0; i < len(haystack)-len(needle); i++ {
j := strings.Index(haystack[i:], needle)
if j < 0 {
break
}
i += j
x = append(x, i)
}
return x
}
func main() {
haystack := `...#...#....#.....#..#..#..#.......`
needle := `..#..`
fmt.Println(findIndices(haystack, needle))
}
Playground: https://play.golang.org/p/nNE5IB1feQT
Output:
[1 5 10 16 19 22 25]
Regular Expression References:
Regular Expression Matching Can Be Simple And Fast
Implementing Regular Expressions
Package [regexp/]syntax

Golang regular expression for parsing key value pair into a string map

I'm looking to parse the following string into a map[string]string using a regular expression:
time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10
I'm trying to create a map that would have
m["time"] = "2017-05-30T19:02:08-05:00"
m["level"] = "info"
etc
I have tried using regex.FindAllStringIndex but can't quite come up with an appropriate regex? Is this the correct way to go?
This is not using regex but is just an example of how to achieve the same by using strings.FieldsFunc.
https://play.golang.org/p/rr6U8xTJZT
package main
import (
"fmt"
"strings"
"unicode"
)
const foo = `time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10`
func main() {
lastQuote := rune(0)
f := func(c rune) bool {
switch {
case c == lastQuote:
lastQuote = rune(0)
return false
case lastQuote != rune(0):
return false
case unicode.In(c, unicode.Quotation_Mark):
lastQuote = c
return false
default:
return unicode.IsSpace(c)
}
}
// splitting string by space but considering quoted section
items := strings.FieldsFunc(foo, f)
// create and fill the map
m := make(map[string]string)
for _, item := range items {
x := strings.Split(item, "=")
m[x[0]] = x[1]
}
// print the map
for k, v := range m {
fmt.Printf("%s: %s\n", k, v)
}
}
Instead of writing regex of your own, you could simply use the github.com/kr/logfmt package.
Package implements the decoding of logfmt key-value pairs.
Example logfmt message:
foo=bar a=14 baz="hello kitty" cool%story=bro f %^asdf
Example result in JSON:
{
"foo": "bar",
"a": 14,
"baz": "hello kitty",
"cool%story": "bro",
"f": true,
"%^asdf": true
}
Use named capturing groups in your regular expression and the FindStringSubmatch and SubexpNames functions. E.g.:
s := `time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10`
re := regexp.MustCompile(`time="(?P<time>.*?)"\slevel=(?P<level>.*?)\s`)
values := re.FindStringSubmatch(s)
keys := re.SubexpNames()
// create map
d := make(map[string]string)
for i := 1; i < len(keys); i++ {
d[keys[i]] = values[i]
}
fmt.Println(d)
// OUTPUT: map[time:2017-05-30T19:02:08-05:00 level:info]
values is a list containing all submatches. The first submatch is the whole expression that matches the regexp, followed by a submatch for each capturing group.
You can wrap the code into a function if you need this more frequently (i.e. if you need something like pythons match.groupdict):
package main
import (
"fmt"
"regexp"
)
func groupmap(s string, r *regexp.Regexp) map[string]string {
values := r.FindStringSubmatch(s)
keys := r.SubexpNames()
// create map
d := make(map[string]string)
for i := 1; i < len(keys); i++ {
d[keys[i]] = values[i]
}
return d
}
func main() {
s := `time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10`
re := regexp.MustCompile(`time="(?P<time>.*?)"\slevel=(?P<level>.*?)\s`)
fmt.Println(groupmap(s, re))
// OUTPUT: map[time:2017-05-30T19:02:08-05:00 level:info]
}