Go: regexp FindAll and ReplaceAll in a single pass - regex

I'm parsing a web page to get some values inside labels, but I'm not interested in the label, only in the content.
I'm using regexp.FindAll to get all the matching expressions (including the label) and then ReplaceAll to replace every subexpression, removing the label. Running the regexp twice takes double of time, of course, and I'd like to avoid it.
Is there a way apply both functions simultaneously, or an equivalent regexp?
Of course, I could make a function to remove the label but in some cases could be more complex because of the variable-length labels (like ) and a regexp can take care of this.
A simple example of my code is here (it won't run in the playground): http://play.golang.org/p/uGKjzmylSY
func main() {
res, err := http.Get("http://www.elpais.es")
if err != nil {
panic(err)
}
body, err := ioutil.ReadAll(res.Body)
fmt.Println("body: ", len(body), cap(body))
res.Body.Close()
if err != nil {
panic(err)
}
r := regexp.MustCompile("<li>(.+)</li>")
// Find all subexpressions, containing the label <li>
out := r.FindAll(body, -1)
for i, v := range out[:10] {
fmt.Printf("%d: %s\n", i, v)
}
//Replace to remove the label.
out2 := make([][]byte, len(out))
for i, v := range out {
out2[i] = r.ReplaceAll(v, []byte("$1"))
}
for i, v := range out2[:10] {
fmt.Printf("%d: %s\n", i, v)
}
}
By the way, I understand that regex cannot be used to parse HTML. I'm only interested in some of the innermost labels, not in the structure or nestings, so I suppose it is OK :)

Recommendation: Use goquery for that task, very simple to use and reduces your code by so much.
Example:
doc, _ := goquery.NewDocument("http://www.elpais.es")
text := doc.Find("li").Slice(10, -1).Text()
Regarding your question, use FindAllSubmatch to extract the match directly:
r := regexp.MustCompile("<li>(.+)</li>")
// Find all subexpressions, containing the label <li>
out := r.FindAllSubmatch(body, -1)
for i, v := range out[:10] {
fmt.Printf("%d: %s\n", i, v[1])
}

Related

How to extract different variables from a regex without compiling each expression

I have a struct representing sizes of computer objects. Objects of this struct are constructed from string values input by users; e.g. "50KB" would be tokenised into an int value of "50" and the string value "KB".
type SizeUnit string
const (
B = "B"
KB = "KB"
MB = "MB"
GB = "GB"
TB = "TB"
)
type ObjectSize struct {
NumberOfUnits int
Unit SizeUnit
}
func NewObjectSizeFromString(input_str string) (*ObjectSize, error)
In the body of this function, I first check if the input value is in the valid format; i.e. any number of digits, followed by any one of "B", "KB", "MB", "GB" or "TB". I then extract the int and string components separately and return a pointer to a struct.
In order to do these three things though, I'm having to compile the regex three times.
The first time to check the format of the input string
rg, err := regexp.Compile(`^[0-9]+B$|KB$|MB$|GB$|TB$`)
And then compile again to fetch the int component:
rg, err := regexp.Compile(`^[0-9]+`)
rg.FindString(input_str)
And then compile again to fetch the string/units component:
rg, err := regexp.Compile(`B$|KB$|MB$|GB$|TB$`)
rg.FindString(input_str)
Is there any way to get the two components from the input string with a single regex compilation?
The full code can be found on the Go Playground.
I should point out that this is an academic question as I'm experimenting with Go's regex library. For a simple use-case of this sort, I would probably use a simple for loop to parse the input string.
You can capture both the values with a single expression using regexp.FindStringSubmatch:
func NewObjectSizeFromString(input_str string) (*ObjectSize, error) {
var defaultReturn *ObjectSize = nil
full_search_pattern := `^([0-9]+)([KMGT]?B)$`
rg, err := regexp.Compile(full_search_pattern)
if err != nil {
return defaultReturn, errors.New("Could not compile search expression")
}
matched := rg.FindStringSubmatch(input_str)
if matched == nil {
return defaultReturn, errors.New("Not in valid format")
}
i, err := strconv.ParseInt(matched[1], 10, 32)
return &ObjectSize{int(i), SizeUnit(matched[2])}, nil
}
See the playground.
The ^([0-9]+)([KMGT]?B)$ regex matches
^ - start of string
([0-9]+) - Group 1 (this value will be held in matched[1]): one or more digits
([KMGT]?B) - Group 2 (it will be in matched[2]): an optional K, M, G, T letter, and then a B letter
$ - end of string.
Note that matched[0] will hold the whole match.

Selecting text between borders using regexp in Go [duplicate]

content := `{null,"Age":24,"Balance":33.23}`
rule,_ := regexp.Compile(`"([^\"]+)"`)
results := rule.FindAllString(content,-1)
fmt.Println(results[0]) //"Age"
fmt.Println(results[1]) //"Balance"
There is a json string with a ``null`` value that it look like this.
This json is from a web api and i don't want to replace anything inside.
I want to using regex to match all the keys in this json which are without the double quote and the output are ``Age`` and ``Balance`` but not ``"Age"`` and ``"Balance"``.
How can I achieve this?
One solution would be to use a regular expression that matches any character between quotes (such as your example or ".*?") and either put a matching group (aka "submatch") inside the quotes or return the relevant substring of the match, using regexp.FindAllStringSubmatch(...) or regexp.FindAllString(...), respectively.
For example (Go Playground):
func main() {
str := `{null,"Age":24,"Balance":33.23}`
fmt.Printf("OK1: %#v\n", getQuotedStrings1(str))
// OK1: []string{"Age", "Balance"}
fmt.Printf("OK2: %#v\n", getQuotedStrings2(str))
// OK2: []string{"Age", "Balance"}
}
var re1 = regexp.MustCompile(`"(.*?)"`) // Note the matching group (submatch).
func getQuotedStrings1(s string) []string {
ms := re1.FindAllStringSubmatch(s, -1)
ss := make([]string, len(ms))
for i, m := range ms {
ss[i] = m[1]
}
return ss
}
var re2 = regexp.MustCompile(`".*?"`)
func getQuotedStrings2(s string) []string {
ms := re2.FindAllString(s, -1)
ss := make([]string, len(ms))
for i, m := range ms {
ss[i] = m[1 : len(m)-1] // Note the substring of the match.
}
return ss
}
Note that the second version (without a submatching group) may be slightly faster based on a simple benchmark, if performance is critical.

Regex extracting sets of numbers from string when prefix occurs, while not matching said prefix

As stated in the title, given a situation where I have a string like so:
"somestring~200~122"
I am wanting to regex to match the numbers when the prefix "~" occurs. So I can ultimately end up with [200, 122].
Matching the prefix is necessary as I need to protect against a case where a string like the one below should not be matched
"somestring~abc200~def122"
For additional context: As stated in the title, I am using go so I am planning on using doing something like the following in order to obtain the numbers within the string:
pattern := regexp.MustCompile("regex i need help with")
numbers := pattern.FindAllString(host, -1)
You can use FindAllStringSubmatch to extract the group containing just the digits. Below is an example that finds all instances of ~ followed by numbers. It additionally converts all the matches to ints
and inserts them into a slice:
package main
import (
"fmt"
"regexp"
"strconv"
)
func main() {
host := "somestring~200~122"
pattern := regexp.MustCompile(`~(\d+)`)
numberStrings := pattern.FindAllStringSubmatch(host, -1)
numbers := make([]int, len(numberStrings))
for i, numberString := range numberStrings {
number, err := strconv.Atoi(numberString[1])
if err != nil {
panic(err)
}
numbers[i] = number
}
fmt.Println(numbers)
}
https://play.golang.org/p/09YyewtRXz

Golang regular expression for parsing key value pair into a string map

I'm looking to parse the following string into a map[string]string using a regular expression:
time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10
I'm trying to create a map that would have
m["time"] = "2017-05-30T19:02:08-05:00"
m["level"] = "info"
etc
I have tried using regex.FindAllStringIndex but can't quite come up with an appropriate regex? Is this the correct way to go?
This is not using regex but is just an example of how to achieve the same by using strings.FieldsFunc.
https://play.golang.org/p/rr6U8xTJZT
package main
import (
"fmt"
"strings"
"unicode"
)
const foo = `time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10`
func main() {
lastQuote := rune(0)
f := func(c rune) bool {
switch {
case c == lastQuote:
lastQuote = rune(0)
return false
case lastQuote != rune(0):
return false
case unicode.In(c, unicode.Quotation_Mark):
lastQuote = c
return false
default:
return unicode.IsSpace(c)
}
}
// splitting string by space but considering quoted section
items := strings.FieldsFunc(foo, f)
// create and fill the map
m := make(map[string]string)
for _, item := range items {
x := strings.Split(item, "=")
m[x[0]] = x[1]
}
// print the map
for k, v := range m {
fmt.Printf("%s: %s\n", k, v)
}
}
Instead of writing regex of your own, you could simply use the github.com/kr/logfmt package.
Package implements the decoding of logfmt key-value pairs.
Example logfmt message:
foo=bar a=14 baz="hello kitty" cool%story=bro f %^asdf
Example result in JSON:
{
"foo": "bar",
"a": 14,
"baz": "hello kitty",
"cool%story": "bro",
"f": true,
"%^asdf": true
}
Use named capturing groups in your regular expression and the FindStringSubmatch and SubexpNames functions. E.g.:
s := `time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10`
re := regexp.MustCompile(`time="(?P<time>.*?)"\slevel=(?P<level>.*?)\s`)
values := re.FindStringSubmatch(s)
keys := re.SubexpNames()
// create map
d := make(map[string]string)
for i := 1; i < len(keys); i++ {
d[keys[i]] = values[i]
}
fmt.Println(d)
// OUTPUT: map[time:2017-05-30T19:02:08-05:00 level:info]
values is a list containing all submatches. The first submatch is the whole expression that matches the regexp, followed by a submatch for each capturing group.
You can wrap the code into a function if you need this more frequently (i.e. if you need something like pythons match.groupdict):
package main
import (
"fmt"
"regexp"
)
func groupmap(s string, r *regexp.Regexp) map[string]string {
values := r.FindStringSubmatch(s)
keys := r.SubexpNames()
// create map
d := make(map[string]string)
for i := 1; i < len(keys); i++ {
d[keys[i]] = values[i]
}
return d
}
func main() {
s := `time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10`
re := regexp.MustCompile(`time="(?P<time>.*?)"\slevel=(?P<level>.*?)\s`)
fmt.Println(groupmap(s, re))
// OUTPUT: map[time:2017-05-30T19:02:08-05:00 level:info]
}

Split string using regular expression in Go

I'm trying to find a good way to split a string using a regular expression instead of a string. Thanks
http://nsf.github.io/go/strings.html?f:Split!
You can use regexp.Split to split a string into a slice of strings with the regex pattern as the delimiter.
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile("[0-9]+")
txt := "Have9834a908123great10891819081day!"
split := re.Split(txt, -1)
set := []string{}
for i := range split {
set = append(set, split[i])
}
fmt.Println(set) // ["Have", "a", "great", "day!"]
}
I made a regex-split function based on the behavior of regex split function in java, c#, php.... It returns only an array of strings, without the index information.
func RegSplit(text string, delimeter string) []string {
reg := regexp.MustCompile(delimeter)
indexes := reg.FindAllStringIndex(text, -1)
laststart := 0
result := make([]string, len(indexes) + 1)
for i, element := range indexes {
result[i] = text[laststart:element[0]]
laststart = element[1]
}
result[len(indexes)] = text[laststart:len(text)]
return result
}
example:
fmt.Println(RegSplit("a1b22c333d", "[0-9]+"))
result:
[a b c d]
If you just want to split on certain characters, you can use strings.FieldsFunc, otherwise I'd go with regexp.FindAllString.
The regexp.Split() function would be the best way to do this.
You should be able to create your own split function that loops over the results of RegExp.FindAllString, placing the intervening substrings into a new array.
http://nsf.github.com/go/regexp.html?m:Regexp.FindAllString!
I found this old post while looking for an answer. I'm new to Go but these answers seem overly complex for the current version of Go. The simple function below returns the same result as those above.
package main
import (
"fmt"
"regexp"
)
func goReSplit(text string, pattern string) []string {
regex := regexp.MustCompile(pattern)
result := regex.Split(text, -1)
return result
}
func main() {
fmt.Printf("%#v\n", goReSplit("Have9834a908123great10891819081day!", "[0-9]+"))
}