Remove all articles and other strings from a string using Go?

Remove all articles and other strings from a string using Go? - regex

Is there any method in Go or having regular expression that it will remove only the articles used in the string?
I have tried below code that will do it but it will also remove other words from the string I'm showing the code below:
removalString := "This is a string"
stringToRemove := []string{"a", "an", "the", "is"}
for _, wordToRemove := range stringToRemove {
removalString = strings.Replace(removalString, wordToRemove, "", -1)
}
space := regexp.MustCompile(`\s+`)
trimedExtraSpaces := space.ReplaceAllString(removalString, " ")
spacesCovertedtoDashes := strings.Replace(trimedExtraSpaces, " ", "-", -1)
slug := strings.ToLower(spacesCovertedtoDashes)
fmt.Println(slug)
Edited
Play link
In this It will remove the is which is used in the this.
The Expected output is this-string

You can use strings.Split and strings.Join plus a loop for filtering and then building it together again:
removalString := "This is a string"
stringToRemove := []string{"a", "an", "the", "is"}
filteredStrings := make([]string, 0)
for _, w := range strings.Split(removalString, " ") {
shouldAppend := true
lowered := strings.ToLower(w)
for _, w2 := range stringToRemove {
if lowered == w2 {
shouldAppend = false
break
}
}
if shouldAppend {
filteredStrings = append(filteredStrings, lowered)
}
}
resultString := strings.Join(filteredStrings, "-")
fmt.Printf(resultString)
Outpus:
this-string
Program exited.
Here you have the live example

My version just using regexp
Construct a regexp of the form '\ba\b|\ban\b|\bthe\b|\bis\b|' which will find
the words in the list that have "word boundaries" on both sides - so "This" is not matched
Second regexp reduces any spaces to dashes and makes multiple spaces a single dash
package main
import (
"bytes"
"fmt"
"regexp"
)
func main() {
removalString := "This is a strange string"
stringToRemove := []string{"a", "an", "the", "is"}
var reg bytes.Buffer
for _, x := range stringToRemove {
reg.WriteString(`\b`) // word boundary
reg.WriteString(x)
reg.WriteString(`\b`)
reg.WriteString(`|`) // alternation operator
}
regx := regexp.MustCompile(reg.String())
slug := regx.ReplaceAllString(removalString, "")
regx2 := regexp.MustCompile(` +`)
slug = regx2.ReplaceAllString(slug, "-")
fmt.Println(slug)
}

Related

Czech characters in regexp search

I am trying to implement very simple text matcher for Czech words. Since Czech language is very suffix heavy I want to define start of the word and then just greedy match rest of the word. This is my implementation so far:
r := regexp.MustCompile("(?i)\\by\\w+\\b")
text := "x yž z"
matches := r.FindAllString(text, -1)
fmt.Println(matches) //have [], want [yž]
I studied Go's regexp syntax:
https://github.com/google/re2/wiki/Syntax
but I don't know, how to define czech language characters there? Using \w just matches ASCII characters, not Czech UTF characters.
Can you please help me?

In RE2, both \w and \b are not Unicode-aware:
\b at ASCII word boundary («\w» on one side and «\W», «\A», or «\z» on the other)
\w word characters (== [0-9A-Za-z_])
A more generalized example will be to split with any chunk of one or more non-letter chars, and then collect only those items that meet your criteria:
package main
import (
"fmt"
"strings"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`\P{L}+`)
str := "x--++yž,,,.z..00"
words := r.Split(str, -1)
for i := range words {
if len(words[i]) > 0 && (strings.HasPrefix(words[i], `y`) || (strings.HasPrefix(words[i], `Y`)) {
output = append(output, words[i])
}
}
fmt.Println(output)
}
See the Go demo.
Note that a naive approach like
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
str := "x--++yž,,,.z..00..."
matches := r.FindAllStringSubmatch(str, -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
won't work in case you have match1,match2 match3 like consecutive matches in the string as it will only getch the odd occurrences since the last non-capturing group pattern will consume the char that is supposed to be matched by the first non-capturing group pattern upon the next match.
A workaround for the above code would be adding some non-letter char to the end of the non-letter streaks, say
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
str := "uhličitá,uhličité,uhličitou,uhličitého,yz,my"
matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `), -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
// => [uhličitá uhličité uhličitou uhličitého]
See this Go demo.
Here, regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `) adds a space after all chunks of non-letter chars.

Include string into another string using regex

I have some set of strings. Strings might have items listed between square brackets. I'd like to include into strings with brackets a constant number of extra items. Brackets might be empty, or absent. For example:
string1 --> string1 # added nothing
string2[] --> string2[extra1="1",extra2="2"] # added two items
string3[item="1"] --> string3[item="1",extra1="1",extra2="2"] # added two items
Currently I achieve this with the following code (Golang):
str1 := "test"
str2 := `test[]`
str3 := `test[item1="1"]`
re := regexp.MustCompile(`\[(.+)?\]`)
for _, s := range []string{str1, str2, str3} {
s = re.ReplaceAllString(s, fmt.Sprintf(`[item1="a",item2="b",$1]`))
fmt.Println(s)
}
But in the output, in case of empty brackets I also got an unwanted comma "," in the end:
test
test[item1="a",item2="b",]
test[item1="a",item2="b",item1="1"]
Is it possible to avoid paste comma in case of empty brackets?
Of course it's possible to parse string again and trim the comma, but it seems suboptimal.
Code example on Go playground

You can have two regexes, where one matches for empty [] and other
matches for string with text inside []. Below is the tested code -
https://play.golang.org/p/_DOOGDMUOCm
Second way is just look back in the string after replacing it. If the
last two characters are ,] and you can substring till , and add ]. I
guess you already know this approach.
package main
import (
"fmt"
"regexp"
)
func main() {
str1 := "test"
str2 := `test[]`
str3 := `test[item1="1"]`
re := regexp.MustCompile(`\[(.*)\]`)
for _, s := range []string{str1, str2, str3} {
matched,err := regexp.Match(`\[(.+)\]`, []byte(s));
_ = err;
if(matched==true){
s = re.ReplaceAllString(s, fmt.Sprintf(`[item1="a",item2="b",$1]`));
}else {
s = re.ReplaceAllString(s, fmt.Sprintf(`[item1="a",item2="b"]`));
}
fmt.Println(s)
}
}

Selecting text between borders using regexp in Go [duplicate]

content := `{null,"Age":24,"Balance":33.23}`
rule,_ := regexp.Compile(`"([^\"]+)"`)
results := rule.FindAllString(content,-1)
fmt.Println(results[0]) //"Age"
fmt.Println(results[1]) //"Balance"
There is a json string with a ``null`` value that it look like this.
This json is from a web api and i don't want to replace anything inside.
I want to using regex to match all the keys in this json which are without the double quote and the output are ``Age`` and ``Balance`` but not ``"Age"`` and ``"Balance"``.
How can I achieve this?

One solution would be to use a regular expression that matches any character between quotes (such as your example or ".*?") and either put a matching group (aka "submatch") inside the quotes or return the relevant substring of the match, using regexp.FindAllStringSubmatch(...) or regexp.FindAllString(...), respectively.
For example (Go Playground):
func main() {
str := `{null,"Age":24,"Balance":33.23}`
fmt.Printf("OK1: %#v\n", getQuotedStrings1(str))
// OK1: []string{"Age", "Balance"}
fmt.Printf("OK2: %#v\n", getQuotedStrings2(str))
// OK2: []string{"Age", "Balance"}
}
var re1 = regexp.MustCompile(`"(.*?)"`) // Note the matching group (submatch).
func getQuotedStrings1(s string) []string {
ms := re1.FindAllStringSubmatch(s, -1)
ss := make([]string, len(ms))
for i, m := range ms {
ss[i] = m[1]
}
return ss
}
var re2 = regexp.MustCompile(`".*?"`)
func getQuotedStrings2(s string) []string {
ms := re2.FindAllString(s, -1)
ss := make([]string, len(ms))
for i, m := range ms {
ss[i] = m[1 : len(m)-1] // Note the substring of the match.
}
return ss
}
Note that the second version (without a submatching group) may be slightly faster based on a simple benchmark, if performance is critical.

How remove email-address from string?

So, I have a string and I want to remove the e-mail adress from it if there is one.
As example:
This is some text and it continues like this
until sometimes an email
adress shows up asd#asd.com
also some more text here and here.
I want this as a result.
This is some text and it continues like this
until sometimes an email
adress shows up [email_removed]
also some more text here and here.
cleanFromEmail(string)
{
newWordString =
space := a_space
Needle = #
wordArray := StrSplit(string, [" ", "`n"])
Loop % wordArray.MaxIndex()
{
thisWord := wordArray[A_Index]
IfInString, thisWord, %Needle%
{
newWordString = %newWordString%%space%(email_removed)%space%
}
else
{
newWordString = %newWordString%%space%%thisWord%%space%
;msgbox asd
}
}
return newWordString
}
The problem with this is that I end up loosing all the line-breaks and only get spaces. How can I rebuild the string to look just like it did before removing the email-adress?

That looks rather complicated, why not use RegExReplace instead?
string =
(
This is some text and it continues like this
until sometimes an email adress shows up asd#asd.com
also some more text here and here.
)
newWordString := RegExReplace(string, "\S+#\S+(?:\.\S+)+", "[email_removed]")
MsgBox, % newWordString
Feel free to make the pattern as simple or as complicated as you want, depending on your needs, but RegExReplace should do it.

If for some reason RegExReplace doesn't always work for you, you can try this:
text =
(
This is some text and it continues like this
until sometimes an email adress shows up asd#asd.com.
also some more text here and here.
)
MsgBox, % cleanFromEmail(text)
cleanFromEmail(string){
lineArray := StrSplit(string, "`n")
Loop % lineArray.MaxIndex()
{
newLine := ""
newWord := ""
thisLine := lineArray[A_Index]
If InStr(thisLine, "#")
{
wordArray := StrSplit(thisLine, " ")
Loop % wordArray.MaxIndex()
{
thisWord := wordArray[A_Index]
{
If InStr(thisWord, "#")
{
end := SubStr(thisWord, 0)
If end in ,,,.,;,?,!
newWord := "[email_removed]" end ""
else
newWord := "[email_removed]"
}
else
newWord := thisWord
}
newLine .= newWord . " " ; concatenate the outputs by adding a space to each one
}
newLine := trim(newLine) ; remove the last space from this variable
}
else
newLine := thisLine
newString .= newLine . "`n"
}
newString := trim(newString)
return newString
}

Golang regular expression for parsing key value pair into a string map

I'm looking to parse the following string into a map[string]string using a regular expression:
time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10
I'm trying to create a map that would have
m["time"] = "2017-05-30T19:02:08-05:00"
m["level"] = "info"
etc
I have tried using regex.FindAllStringIndex but can't quite come up with an appropriate regex? Is this the correct way to go?

This is not using regex but is just an example of how to achieve the same by using strings.FieldsFunc.
https://play.golang.org/p/rr6U8xTJZT
package main
import (
"fmt"
"strings"
"unicode"
)
const foo = `time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10`
func main() {
lastQuote := rune(0)
f := func(c rune) bool {
switch {
case c == lastQuote:
lastQuote = rune(0)
return false
case lastQuote != rune(0):
return false
case unicode.In(c, unicode.Quotation_Mark):
lastQuote = c
return false
default:
return unicode.IsSpace(c)
}
}
// splitting string by space but considering quoted section
items := strings.FieldsFunc(foo, f)
// create and fill the map
m := make(map[string]string)
for _, item := range items {
x := strings.Split(item, "=")
m[x[0]] = x[1]
}
// print the map
for k, v := range m {
fmt.Printf("%s: %s\n", k, v)
}
}

Instead of writing regex of your own, you could simply use the github.com/kr/logfmt package.
Package implements the decoding of logfmt key-value pairs.
Example logfmt message:
foo=bar a=14 baz="hello kitty" cool%story=bro f %^asdf
Example result in JSON:
{
"foo": "bar",
"a": 14,
"baz": "hello kitty",
"cool%story": "bro",
"f": true,
"%^asdf": true
}

Use named capturing groups in your regular expression and the FindStringSubmatch and SubexpNames functions. E.g.:
s := `time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10`
re := regexp.MustCompile(`time="(?P<time>.*?)"\slevel=(?P<level>.*?)\s`)
values := re.FindStringSubmatch(s)
keys := re.SubexpNames()
// create map
d := make(map[string]string)
for i := 1; i < len(keys); i++ {
d[keys[i]] = values[i]
}
fmt.Println(d)
// OUTPUT: map[time:2017-05-30T19:02:08-05:00 level:info]
values is a list containing all submatches. The first submatch is the whole expression that matches the regexp, followed by a submatch for each capturing group.
You can wrap the code into a function if you need this more frequently (i.e. if you need something like pythons match.groupdict):
package main
import (
"fmt"
"regexp"
)
func groupmap(s string, r *regexp.Regexp) map[string]string {
values := r.FindStringSubmatch(s)
keys := r.SubexpNames()
// create map
d := make(map[string]string)
for i := 1; i < len(keys); i++ {
d[keys[i]] = values[i]
}
return d
}
func main() {
s := `time="2017-05-30T19:02:08-05:00" level=info msg="some log message" app=sample size=10`
re := regexp.MustCompile(`time="(?P<time>.*?)"\slevel=(?P<level>.*?)\s`)
fmt.Println(groupmap(s, re))
// OUTPUT: map[time:2017-05-30T19:02:08-05:00 level:info]
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove all articles and other strings from a string using Go? - regex

Related

Czech characters in regexp search

Include string into another string using regex

Selecting text between borders using regexp in Go [duplicate]

How remove email-address from string?

Golang regular expression for parsing key value pair into a string map

Categories

Resources