apostrophe in word not being recognized for string replace - regex

I am having a problem replacing the word "you're" with regexp.
All of the other words are changing correctly just the word "you're".
I think it is not parsing after the apostrophe.
I have to replace the word "you" to "I" and "you're" to "I'm".
It will change "you" to "I" but "you're" becomes "I're" because it is not going past the apostrophe and it thinks that is the end of the word for some reason. I have to escape the apostrophe somehow.
Please see below for the code in question.
package main
import (
"fmt"
"math/rand"
"regexp"
"strings"
"time"
)
//Function ElizaResponse to take in and return a string
func ElizaResponse(str string) string {
// replace := "How do you know you are"
/*Regex MatchString function with isolation of the word "father"
*with a boundry ignore case regex command.
*/
if matched, _ := regexp.MatchString(`(?i)\bfather\b`, str);
//Condition to replace the original string if it has the word "father"
matched {
return "Why don’t you tell me more about your father?"
}
r1 := regexp.MustCompile(`(?i)\bI'?\s*a?m\b`)
//Match the words "I am" and capture for replacement
matched := r1.MatchString(str)
//condition if "I am" is matched
if matched {
capturedString := r1.ReplaceAllString(str, "$1")
boundaries := regexp.MustCompile(`\b`)
tokens := boundaries.Split(capturedString, -1)
// List the reflections.
reflections := [][]string{
{`I`, `you`},
{`you're`, `I'm`},
{`your`, `my`},
{`me`, `you`},
{`you`, `I`},
{`my`, `your`},
}
// Loop through each token, reflecting it if there's a match.
for i, token := range tokens {
for _, reflection := range reflections {
if matched, _ := regexp.MatchString(reflection[0], token); matched {
tokens[i] = reflection[1]
break
}
}
}
// Put the tokens back together.
return strings.Join(tokens, ``)
}
//Get random number from the length of the array of random struct
//an array of strings for the random response
response := []string{"I’m not sure what you’re trying to say. Could you explain it to me?",
"How does that make you feel?",
"Why do you say that?"}
//Return a random index of the array
return response[rand.Intn(len(response))]
}
func main() {
rand.Seed(time.Now().UTC().UnixNano())
fmt.Println("Im supposed to just take what you're saying at face value?")
fmt.Println(ElizaResponse("Im supposed to just take what you're saying at face value?"))
}

Note that the apostrophe character creates a word boundary, so your use of \b in regular expressions is probably tripping you up. That is, the string "I'm" has four word boundaries, one before and after each character.
┏━┳━┳━┓
┃I┃'┃m┃
┗━┻━┻━┛
│ │ │ └─ end of line creates a word boundary
│ │ └─── after punctuation character creates a word boundary
│ └───── before punctuation character creates a word boundary
└─────── start of line creates a word boundary
There is no way to change the behavior of the word boundary metacharacter so you might be better off mapping regexes that include the full word with punctuation to the desired replacement, e.g.:
type Replacement struct {
rgx *regexp.Regexp
rpl string
}
replacements := []Replacement{
{regexp.MustCompile("\\bI\\b"), "you"},
{regexp.MustCompile("\\byou're\\b"), "I'm"},
// etc...
}
Note also that one of your examples contains a UTF-8 "right single quotation mark" (U+2019, 0xe28099), not to be confused with the UTF-8/ASCII apostrophe (U+0027, 0x27)!
fmt.Sprintf("% x", []byte("'’")) // => "27 e2 80 99"

What you want to achieve here is to replace specific strings with specific replacements. It is easier to achieve that with a map of string keys and values, where each unique key is a literal phrase to search and the values are the texts to replace with.
This how you may define the reflections:
reflections := map[string]string{
`you're`: `I'm`,
`your`: `my`,
`me`: `you`,
`you`: `I`,
`my`: `your`,
`I` : `you`,
}
Next, you need to get the keys in the descending by length order (here is a sample code):
type ByLenDesc []string
func (a ByLenDesc) Len() int {
return len(a)
}
func (a ByLenDesc) Less(i, j int) bool {
return len(a[i]) > len(a[j])
}
func (a ByLenDesc) Swap(i, j int) {
a[i], a[j] = a[j], a[i]
}
And then in the function:
var keys []string
for key, _ := range reflections {
keys = append(keys, key)
}
sort.Sort(ByLenDesc(keys))
Then build the pattern:
pat := "\\b(" + strings.Join(keys, `|`) + ")\\b"
// fmt.Println(pat) // => \b(you're|your|you|me|my|I)\b
The pattern matches you're, your, you, me, my, or I as whole words.
res := regexp.MustCompile(pat).ReplaceAllStringFunc(capturedString, func(m string) string {
return reflections[m]
})
The above code creates a regex object and replaces all matches with the corresponding reflections values.
See the Go demo.

I have found that i just needed to change these two lines of code.
boundaries := regexp.MustCompile(`(\b[^\w']|$)`)
return strings.Join(tokens, ` `)
Its stops the split function from splitting at the ' character.
Then the return of tokens needs a space to put out the string otherwise it would be a continuous string.

Related

Czech characters in regexp search

I am trying to implement very simple text matcher for Czech words. Since Czech language is very suffix heavy I want to define start of the word and then just greedy match rest of the word. This is my implementation so far:
r := regexp.MustCompile("(?i)\\by\\w+\\b")
text := "x yž z"
matches := r.FindAllString(text, -1)
fmt.Println(matches) //have [], want [yž]
I studied Go's regexp syntax:
https://github.com/google/re2/wiki/Syntax
but I don't know, how to define czech language characters there? Using \w just matches ASCII characters, not Czech UTF characters.
Can you please help me?
In RE2, both \w and \b are not Unicode-aware:
\b at ASCII word boundary («\w» on one side and «\W», «\A», or «\z» on the other)
\w word characters (== [0-9A-Za-z_])
A more generalized example will be to split with any chunk of one or more non-letter chars, and then collect only those items that meet your criteria:
package main
import (
"fmt"
"strings"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`\P{L}+`)
str := "x--++yž,,,.z..00"
words := r.Split(str, -1)
for i := range words {
if len(words[i]) > 0 && (strings.HasPrefix(words[i], `y`) || (strings.HasPrefix(words[i], `Y`)) {
output = append(output, words[i])
}
}
fmt.Println(output)
}
See the Go demo.
Note that a naive approach like
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
str := "x--++yž,,,.z..00..."
matches := r.FindAllStringSubmatch(str, -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
won't work in case you have match1,match2 match3 like consecutive matches in the string as it will only getch the odd occurrences since the last non-capturing group pattern will consume the char that is supposed to be matched by the first non-capturing group pattern upon the next match.
A workaround for the above code would be adding some non-letter char to the end of the non-letter streaks, say
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
str := "uhličitá,uhličité,uhličitou,uhličitého,yz,my"
matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `), -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
// => [uhličitá uhličité uhličitou uhličitého]
See this Go demo.
Here, regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `) adds a space after all chunks of non-letter chars.

Include string into another string using regex

I have some set of strings. Strings might have items listed between square brackets. I'd like to include into strings with brackets a constant number of extra items. Brackets might be empty, or absent. For example:
string1 --> string1 # added nothing
string2[] --> string2[extra1="1",extra2="2"] # added two items
string3[item="1"] --> string3[item="1",extra1="1",extra2="2"] # added two items
Currently I achieve this with the following code (Golang):
str1 := "test"
str2 := `test[]`
str3 := `test[item1="1"]`
re := regexp.MustCompile(`\[(.+)?\]`)
for _, s := range []string{str1, str2, str3} {
s = re.ReplaceAllString(s, fmt.Sprintf(`[item1="a",item2="b",$1]`))
fmt.Println(s)
}
But in the output, in case of empty brackets I also got an unwanted comma "," in the end:
test
test[item1="a",item2="b",]
test[item1="a",item2="b",item1="1"]
Is it possible to avoid paste comma in case of empty brackets?
Of course it's possible to parse string again and trim the comma, but it seems suboptimal.
Code example on Go playground
You can have two regexes, where one matches for empty [] and other
matches for string with text inside []. Below is the tested code -
https://play.golang.org/p/_DOOGDMUOCm
Second way is just look back in the string after replacing it. If the
last two characters are ,] and you can substring till , and add ]. I
guess you already know this approach.
package main
import (
"fmt"
"regexp"
)
func main() {
str1 := "test"
str2 := `test[]`
str3 := `test[item1="1"]`
re := regexp.MustCompile(`\[(.*)\]`)
for _, s := range []string{str1, str2, str3} {
matched,err := regexp.Match(`\[(.+)\]`, []byte(s));
_ = err;
if(matched==true){
s = re.ReplaceAllString(s, fmt.Sprintf(`[item1="a",item2="b",$1]`));
}else {
s = re.ReplaceAllString(s, fmt.Sprintf(`[item1="a",item2="b"]`));
}
fmt.Println(s)
}
}

replace all characters in string except last 4 characters

Using Go, how do I replace all characters in a string with "X" except the last 4 characters?
This works fine for php/javascript but not for golang as "?=" is not supported.
\w(?=\w{4,}$)
Tried this, but does not work. I couldn't find anything similar for golang
(\w)(?:\w{4,}$)
JavaScript working link
Go non-working link
A simple yet efficient solution that handles multi UTF-8-byte characters is to convert the string to []rune, overwrite runes with 'X' (except the last 4), then convert back to string.
func maskLeft(s string) string {
rs := []rune(s)
for i := 0; i < len(rs)-4; i++ {
rs[i] = 'X'
}
return string(rs)
}
Testing it:
fmt.Println(maskLeft("123"))
fmt.Println(maskLeft("123456"))
fmt.Println(maskLeft("1234世界"))
fmt.Println(maskLeft("世界3456"))
Output (try it on the Go Playground):
123
XX3456
XX34世界
XX3456
Also see related question: How to replace all characters in a string in golang
Let's say inputString is the string you want to mask all the characters of (except the last four).
First get the last four characters of the string:
last4 := string(inputString[len(inputString)-4:])
Then get a string of X's which is the same length as inputString, minus 4:
re := regexp.MustCompile("\w")
maskedPart := re.ReplaceAllString(inputString[0:len(inputString)-5], "X")
Then combine maskedPart and last4 to get your result:
maskedString := strings.Join([]string{maskedPart,last4},"")
Simpler approach without regex and looping
package main
import (
"fmt"
"strings"
)
func main() {
string := "thisisarandomstring"
head := string[:len(string)-4]
tail := string[len(string)-4:]
mask := strings.Repeat("x", len(head))
fmt.Printf("%v%v", mask, tail)
}
// Output:
// xxxxxxxxxxxxxxxring
Create a Regexp with
re := regexp.MustCompile("\w{4}$")
Let's say inputString is the string you want to remove the last four characters from. Use this code to return a copy of inputString without the last 4 characters:
re.ReplaceAllString(inputString, "")
Note: if it's possible that your input string could start out with less than four characters, and you still want those characters removed since they are at the end of the string, you should instead use:
re := regexp.MustCompile("\w{0,4}$")

Golang Regex extract text between 2 delimiters - including delimiters

As stated in the title I have an program in golang where I have a string with a reoccurring pattern. I have a beginning and end delimiters for this pattern, and I would like to extract them from the string. The following is pseudo code:
string := "... This is preceding text
PATTERN BEGINS HERE (
pattern can continue for any number of lines...
);
this is trailing text that is not part of the pattern"
In short what I am attempting to do is from the example above is extract all occurrences of of the pattern that begins with "PATTERN BEGINS HERE" and ends with ");" And I need help in figuring out what the regex for this looks like.
Please let me know if any additional info or context is needed.
The regex is:
(?s)PATTERN BEGINS HERE.*?\);
where (?s) is a flag to let .* match multiple lines (see Go regex syntax).
See demo
Not regex, but works
func findInString(str, start, end string) ([]byte, error) {
var match []byte
index := strings.Index(str, start)
if index == -1 {
return match, errors.New("Not found")
}
index += len(start)
for {
char := str[index]
if strings.HasPrefix(str[index:index+len(match)], end) {
break
}
match = append(match, char)
index++
}
return match, nil
}
EDIT: Best to handle individual character as bytes and return a byte array

Golang regexp to match multiple patterns between keyword pairs

I have a string which has two keywords: "CURRENT NAME(S)" and "NEW NAME(S)" and each of these keywords are followed by a bunch of words. I want to extract those set of words beyond each of these keywords. To elaborate with a code:
s := `"CURRENT NAME(S)
Name1, Name2",,"NEW NAME(S)
NewName1,NewName2"`
re := regexp.MustCompile(`"CURRENT NAME(S).*",,"NEW NAME(S).*"`)
segs := re.FindAllString(s, -1)
fmt.Println("segs:", segs)
segs2 := re.FindAllStringSubmatch(s, -1)
fmt.Println("segs2:", segs2)
As you can see, the string 's' has the input. "Name1,Name2" is the current names list and "NewName1, NewName2" is the new names list. I want to extract these two lists. The two lists are separated by a comma. Each of the keywords are beginning with a double quote and their reach ends, when their corresponding double quote ends.
What is the way to use regexp such that the program can print "Name1, Name2" and "NewName1,NewName2" ?
The issue with your regex is that the input string contains newline symbols, and . in Go regex does not match a newline. Another issue is that the .* is a greedy pattern and will match as many symbols as it can up to the last second keyword. Also, you need to escape parentheses in the regex pattern to match the ( and ) literal symbols.
The best way to solve the issue is to change .* into a negated character class pattern [^"]* and place it inside a pair of non-escaped ( and ) to form a capturing group (a construct to get submatches from the match).
Here is a Go demo:
package main
import (
"fmt"
"regexp"
)
func main() {
s := `"CURRENT NAME(S)
Name1, Name2",,"NEW NAME(S)
NewName1,NewName2"`
re := regexp.MustCompile(`"CURRENT NAME\(S\)\s*([^"]*)",,"NEW NAME\(S\)\s*([^"]*)"`)
segs2 := re.FindAllStringSubmatch(s,-1)
fmt.Printf("segs2: [%s; %s]", segs2[0][1], segs2[0][2])
}
Now, the regex matches:
"CURRENT NAME\(S\) - a literal string "CURRENT NAME(S)`
\s* - zero or more whitespaces
([^"]*) - Group 1 capturing 0+ chars other than "
",,"NEW NAME\(S\) - a literal string ",,"NEW NAME(S)
\s* - zero or more whitespaces
([^"]*) - Group 2 capturing 0+ chars other than "
" - a literal "
If your input doesn't change then the simplest way would be to use submatches (groups). You can try something like this:
// (?s) is a flag that enables '.' to match newlines
var r = regexp.MustCompile(`(?s)CURRENT NAME\(S\)(.*)",,"NEW NAME\(S\)(.*)"`)
fmt.Println(r.MatchString(s))
m := r.FindSubmatch([]byte(s)) // FindSubmatch requires []byte
for _, match := range m {
s := string(match)
fmt.Printf("Match - %d: %s\n", i, strings.Trim(s, "\n")) //remove the newline
}
Output: (Note that the first match is the entire input string because it completely matches the regex (https://golang.org/pkg/regexp/#Regexp.FindSubmatch)
Match - 0: CURRENT NAME(S)
Name1, Name2",,"NEW NAME(S)
NewName1,NewName2"
Match - 1: Name1, Name2
Match - 2: NewName1,NewName2
Example: https://play.golang.org/p/0cgBOMumtp
For a fixed format like in the example, you can also avoid regular expressions and perform explicit parsing as in this example - https://play.golang.org/p/QDIyYiWJHt:
package main
import (
"fmt"
"strings"
)
func main() {
s := `"CURRENT NAME(S)
Name1, Name2",,"NEW NAME(S)
NewName1,NewName2"`
names := []string{}
parts := strings.Split(s, ",,")
for _, part := range parts {
part = strings.Trim(part, `"`)
part = strings.TrimPrefix(part, "CURRENT NAME(S)")
part = strings.TrimPrefix(part, "NEW NAME(S)")
part = strings.TrimSpace(part)
names = append(names, part)
}
fmt.Println("Names:")
for _, name := range names {
fmt.Println(name)
}
}
Output:
Names:
Name1, Name2
NewName1,NewName2
It uses a few more lines of code but may make it easier to understand the processing logic at a first glance.