How to extract different variables from a regex without compiling each expression - regex

I have a struct representing sizes of computer objects. Objects of this struct are constructed from string values input by users; e.g. "50KB" would be tokenised into an int value of "50" and the string value "KB".
type SizeUnit string
const (
B = "B"
KB = "KB"
MB = "MB"
GB = "GB"
TB = "TB"
)
type ObjectSize struct {
NumberOfUnits int
Unit SizeUnit
}
func NewObjectSizeFromString(input_str string) (*ObjectSize, error)
In the body of this function, I first check if the input value is in the valid format; i.e. any number of digits, followed by any one of "B", "KB", "MB", "GB" or "TB". I then extract the int and string components separately and return a pointer to a struct.
In order to do these three things though, I'm having to compile the regex three times.
The first time to check the format of the input string
rg, err := regexp.Compile(`^[0-9]+B$|KB$|MB$|GB$|TB$`)
And then compile again to fetch the int component:
rg, err := regexp.Compile(`^[0-9]+`)
rg.FindString(input_str)
And then compile again to fetch the string/units component:
rg, err := regexp.Compile(`B$|KB$|MB$|GB$|TB$`)
rg.FindString(input_str)
Is there any way to get the two components from the input string with a single regex compilation?
The full code can be found on the Go Playground.
I should point out that this is an academic question as I'm experimenting with Go's regex library. For a simple use-case of this sort, I would probably use a simple for loop to parse the input string.

You can capture both the values with a single expression using regexp.FindStringSubmatch:
func NewObjectSizeFromString(input_str string) (*ObjectSize, error) {
var defaultReturn *ObjectSize = nil
full_search_pattern := `^([0-9]+)([KMGT]?B)$`
rg, err := regexp.Compile(full_search_pattern)
if err != nil {
return defaultReturn, errors.New("Could not compile search expression")
}
matched := rg.FindStringSubmatch(input_str)
if matched == nil {
return defaultReturn, errors.New("Not in valid format")
}
i, err := strconv.ParseInt(matched[1], 10, 32)
return &ObjectSize{int(i), SizeUnit(matched[2])}, nil
}
See the playground.
The ^([0-9]+)([KMGT]?B)$ regex matches
^ - start of string
([0-9]+) - Group 1 (this value will be held in matched[1]): one or more digits
([KMGT]?B) - Group 2 (it will be in matched[2]): an optional K, M, G, T letter, and then a B letter
$ - end of string.
Note that matched[0] will hold the whole match.

Related

Selecting text between borders using regexp in Go [duplicate]

content := `{null,"Age":24,"Balance":33.23}`
rule,_ := regexp.Compile(`"([^\"]+)"`)
results := rule.FindAllString(content,-1)
fmt.Println(results[0]) //"Age"
fmt.Println(results[1]) //"Balance"
There is a json string with a ``null`` value that it look like this.
This json is from a web api and i don't want to replace anything inside.
I want to using regex to match all the keys in this json which are without the double quote and the output are ``Age`` and ``Balance`` but not ``"Age"`` and ``"Balance"``.
How can I achieve this?
One solution would be to use a regular expression that matches any character between quotes (such as your example or ".*?") and either put a matching group (aka "submatch") inside the quotes or return the relevant substring of the match, using regexp.FindAllStringSubmatch(...) or regexp.FindAllString(...), respectively.
For example (Go Playground):
func main() {
str := `{null,"Age":24,"Balance":33.23}`
fmt.Printf("OK1: %#v\n", getQuotedStrings1(str))
// OK1: []string{"Age", "Balance"}
fmt.Printf("OK2: %#v\n", getQuotedStrings2(str))
// OK2: []string{"Age", "Balance"}
}
var re1 = regexp.MustCompile(`"(.*?)"`) // Note the matching group (submatch).
func getQuotedStrings1(s string) []string {
ms := re1.FindAllStringSubmatch(s, -1)
ss := make([]string, len(ms))
for i, m := range ms {
ss[i] = m[1]
}
return ss
}
var re2 = regexp.MustCompile(`".*?"`)
func getQuotedStrings2(s string) []string {
ms := re2.FindAllString(s, -1)
ss := make([]string, len(ms))
for i, m := range ms {
ss[i] = m[1 : len(m)-1] // Note the substring of the match.
}
return ss
}
Note that the second version (without a submatching group) may be slightly faster based on a simple benchmark, if performance is critical.

replace all characters in string except last 4 characters

Using Go, how do I replace all characters in a string with "X" except the last 4 characters?
This works fine for php/javascript but not for golang as "?=" is not supported.
\w(?=\w{4,}$)
Tried this, but does not work. I couldn't find anything similar for golang
(\w)(?:\w{4,}$)
JavaScript working link
Go non-working link
A simple yet efficient solution that handles multi UTF-8-byte characters is to convert the string to []rune, overwrite runes with 'X' (except the last 4), then convert back to string.
func maskLeft(s string) string {
rs := []rune(s)
for i := 0; i < len(rs)-4; i++ {
rs[i] = 'X'
}
return string(rs)
}
Testing it:
fmt.Println(maskLeft("123"))
fmt.Println(maskLeft("123456"))
fmt.Println(maskLeft("1234世界"))
fmt.Println(maskLeft("世界3456"))
Output (try it on the Go Playground):
123
XX3456
XX34世界
XX3456
Also see related question: How to replace all characters in a string in golang
Let's say inputString is the string you want to mask all the characters of (except the last four).
First get the last four characters of the string:
last4 := string(inputString[len(inputString)-4:])
Then get a string of X's which is the same length as inputString, minus 4:
re := regexp.MustCompile("\w")
maskedPart := re.ReplaceAllString(inputString[0:len(inputString)-5], "X")
Then combine maskedPart and last4 to get your result:
maskedString := strings.Join([]string{maskedPart,last4},"")
Simpler approach without regex and looping
package main
import (
"fmt"
"strings"
)
func main() {
string := "thisisarandomstring"
head := string[:len(string)-4]
tail := string[len(string)-4:]
mask := strings.Repeat("x", len(head))
fmt.Printf("%v%v", mask, tail)
}
// Output:
// xxxxxxxxxxxxxxxring
Create a Regexp with
re := regexp.MustCompile("\w{4}$")
Let's say inputString is the string you want to remove the last four characters from. Use this code to return a copy of inputString without the last 4 characters:
re.ReplaceAllString(inputString, "")
Note: if it's possible that your input string could start out with less than four characters, and you still want those characters removed since they are at the end of the string, you should instead use:
re := regexp.MustCompile("\w{0,4}$")

apostrophe in word not being recognized for string replace

I am having a problem replacing the word "you're" with regexp.
All of the other words are changing correctly just the word "you're".
I think it is not parsing after the apostrophe.
I have to replace the word "you" to "I" and "you're" to "I'm".
It will change "you" to "I" but "you're" becomes "I're" because it is not going past the apostrophe and it thinks that is the end of the word for some reason. I have to escape the apostrophe somehow.
Please see below for the code in question.
package main
import (
"fmt"
"math/rand"
"regexp"
"strings"
"time"
)
//Function ElizaResponse to take in and return a string
func ElizaResponse(str string) string {
// replace := "How do you know you are"
/*Regex MatchString function with isolation of the word "father"
*with a boundry ignore case regex command.
*/
if matched, _ := regexp.MatchString(`(?i)\bfather\b`, str);
//Condition to replace the original string if it has the word "father"
matched {
return "Why don’t you tell me more about your father?"
}
r1 := regexp.MustCompile(`(?i)\bI'?\s*a?m\b`)
//Match the words "I am" and capture for replacement
matched := r1.MatchString(str)
//condition if "I am" is matched
if matched {
capturedString := r1.ReplaceAllString(str, "$1")
boundaries := regexp.MustCompile(`\b`)
tokens := boundaries.Split(capturedString, -1)
// List the reflections.
reflections := [][]string{
{`I`, `you`},
{`you're`, `I'm`},
{`your`, `my`},
{`me`, `you`},
{`you`, `I`},
{`my`, `your`},
}
// Loop through each token, reflecting it if there's a match.
for i, token := range tokens {
for _, reflection := range reflections {
if matched, _ := regexp.MatchString(reflection[0], token); matched {
tokens[i] = reflection[1]
break
}
}
}
// Put the tokens back together.
return strings.Join(tokens, ``)
}
//Get random number from the length of the array of random struct
//an array of strings for the random response
response := []string{"I’m not sure what you’re trying to say. Could you explain it to me?",
"How does that make you feel?",
"Why do you say that?"}
//Return a random index of the array
return response[rand.Intn(len(response))]
}
func main() {
rand.Seed(time.Now().UTC().UnixNano())
fmt.Println("Im supposed to just take what you're saying at face value?")
fmt.Println(ElizaResponse("Im supposed to just take what you're saying at face value?"))
}
Note that the apostrophe character creates a word boundary, so your use of \b in regular expressions is probably tripping you up. That is, the string "I'm" has four word boundaries, one before and after each character.
┏━┳━┳━┓
┃I┃'┃m┃
┗━┻━┻━┛
│ │ │ └─ end of line creates a word boundary
│ │ └─── after punctuation character creates a word boundary
│ └───── before punctuation character creates a word boundary
└─────── start of line creates a word boundary
There is no way to change the behavior of the word boundary metacharacter so you might be better off mapping regexes that include the full word with punctuation to the desired replacement, e.g.:
type Replacement struct {
rgx *regexp.Regexp
rpl string
}
replacements := []Replacement{
{regexp.MustCompile("\\bI\\b"), "you"},
{regexp.MustCompile("\\byou're\\b"), "I'm"},
// etc...
}
Note also that one of your examples contains a UTF-8 "right single quotation mark" (U+2019, 0xe28099), not to be confused with the UTF-8/ASCII apostrophe (U+0027, 0x27)!
fmt.Sprintf("% x", []byte("'’")) // => "27 e2 80 99"
What you want to achieve here is to replace specific strings with specific replacements. It is easier to achieve that with a map of string keys and values, where each unique key is a literal phrase to search and the values are the texts to replace with.
This how you may define the reflections:
reflections := map[string]string{
`you're`: `I'm`,
`your`: `my`,
`me`: `you`,
`you`: `I`,
`my`: `your`,
`I` : `you`,
}
Next, you need to get the keys in the descending by length order (here is a sample code):
type ByLenDesc []string
func (a ByLenDesc) Len() int {
return len(a)
}
func (a ByLenDesc) Less(i, j int) bool {
return len(a[i]) > len(a[j])
}
func (a ByLenDesc) Swap(i, j int) {
a[i], a[j] = a[j], a[i]
}
And then in the function:
var keys []string
for key, _ := range reflections {
keys = append(keys, key)
}
sort.Sort(ByLenDesc(keys))
Then build the pattern:
pat := "\\b(" + strings.Join(keys, `|`) + ")\\b"
// fmt.Println(pat) // => \b(you're|your|you|me|my|I)\b
The pattern matches you're, your, you, me, my, or I as whole words.
res := regexp.MustCompile(pat).ReplaceAllStringFunc(capturedString, func(m string) string {
return reflections[m]
})
The above code creates a regex object and replaces all matches with the corresponding reflections values.
See the Go demo.
I have found that i just needed to change these two lines of code.
boundaries := regexp.MustCompile(`(\b[^\w']|$)`)
return strings.Join(tokens, ` `)
Its stops the split function from splitting at the ' character.
Then the return of tokens needs a space to put out the string otherwise it would be a continuous string.

Regex extracting sets of numbers from string when prefix occurs, while not matching said prefix

As stated in the title, given a situation where I have a string like so:
"somestring~200~122"
I am wanting to regex to match the numbers when the prefix "~" occurs. So I can ultimately end up with [200, 122].
Matching the prefix is necessary as I need to protect against a case where a string like the one below should not be matched
"somestring~abc200~def122"
For additional context: As stated in the title, I am using go so I am planning on using doing something like the following in order to obtain the numbers within the string:
pattern := regexp.MustCompile("regex i need help with")
numbers := pattern.FindAllString(host, -1)
You can use FindAllStringSubmatch to extract the group containing just the digits. Below is an example that finds all instances of ~ followed by numbers. It additionally converts all the matches to ints
and inserts them into a slice:
package main
import (
"fmt"
"regexp"
"strconv"
)
func main() {
host := "somestring~200~122"
pattern := regexp.MustCompile(`~(\d+)`)
numberStrings := pattern.FindAllStringSubmatch(host, -1)
numbers := make([]int, len(numberStrings))
for i, numberString := range numberStrings {
number, err := strconv.Atoi(numberString[1])
if err != nil {
panic(err)
}
numbers[i] = number
}
fmt.Println(numbers)
}
https://play.golang.org/p/09YyewtRXz

Golang Regex: FindAllStringSubmatch to []string

I download a multiline file from Amazon S3 in format like:
ColumnAv1 ColumnBv1 ColumnCv1 ...
ColumnAv2 ColumnBv2 ColumnCv2 ...
the file is of type byte. Then I want to parse this with regex:
matches := re.FindAllSubmatch(file,-1)
then I want to feed result row by row to function which takes []string as input (string[0] is ColumnAv1, string[1] is ColumnBv2, ...).
How should I convert result of [][][]byte to []string containing first, second, etc row? I suppose I should do it in a loop, but I cannot get this working:
for i:=0;i<len(len(matches);i++{
tmp:=myfunction(???)
}
BTW, Why does function FindAllSubmatch return [][][]byte whereas FindAllStringSubmatch return [][]string?
(Sorry I don't have right now access to my real example, so the syntax may not be proper)
It's all explained extensively in the package's documentation.
Read the parapgraph which explains :
There are 16 methods of Regexp that match a regular expression and identify the matched text. Their names are matched by this regular expression:
Find(All)?(String)?(Submatch)?(Index)?
In your case, you probably want to use FindAllStringSubmatch.
In Go, a string is just a read-only []byte.
You can choose to either keep passing []byte variables around,
or cast the []byte values to string :
var byteSlice = []byte{'F','o','o'}
var str string
str = string(byteSlice)
You can simply iterate through the bytes result as you would do for strings result using two nested loop, and just convert slice of bytes to a string in the second loop:
package main
import "fmt"
func main() {
f := [][][]byte{{{'a', 'b', 'c'}}}
for _, line := range f {
for _, match := range line { // match is a type of []byte
fmt.Println(string(match))
}
}
}
Playground