I am using Golang regex package, I want to use regex ReplaceAllStringFunc with argument, not only with the source string.
For example, I want to update this text
"<img src=\"/m/1.jpg\" /> <img src=\"/m/2.jpg\" /> <img src=\"/m/3.jpg\" />"
To (change "m" to "a" or anything else):
"<img src=\"/a/1.jpg\" /> <img src=\"/a/2.jpg\" /> <img src=\"/a/3.jpg\" />"
I would like to have something like:
func UpdateText(text string) string {
re, _ := regexp.Compile(`<img.*?src=\"(.*?)\"`)
text = re.ReplaceAllStringFunc(text, updateImgSrc)
return text
}
// update "/m/1.jpg" to "/a/1.jpg"
func updateImgSrc(imgSrcText, prefix string) string {
// replace "m" by prefix
return "<img src=\"" + newImgSrc + "\""
}
I checked the doc, ReplaceAllStringFunc doesn't support argument, but what would be the best way to achieve my goal?
More generally, I would like to find all occurrences of one pattern then update each with a new string which is composed by source string + a new parameter, could anyone give any idea?
I agree with the comments, you probably don't want to parse HTML with regular expressions (bad things will happen).
However, let's pretend it's not HTML, and you want to only replace submatches. You could do this
func UpdateText(input string) (string, error) {
re, err := regexp.Compile(`img.*?src=\"(.*?)\.(.*?)\"`)
if err != nil {
return "", err
}
indexes := re.FindAllStringSubmatchIndex(input, -1)
output := input
for _, match := range indexes {
imgStart := match[2]
imgEnd := match[3]
newImgName := strings.Replace(input[imgStart:imgEnd], "m", "a", -1)
output = output[:imgStart] + newImgName + input[imgEnd:]
}
return output, nil
}
see on playground
(note that I've slightly changed your regular expression to match the file extension separately)
thanks for kostix's advice, here is my solution using html parser.
func UpdateAllResourcePath(text, prefix string) (string, error) {
doc, err := goquery.NewDocumentFromReader(strings.NewReader(text))
if err != nil {
return "", err
}
sel := doc.Find("img")
length := len(sel.Nodes)
for index := 0; index < length; index++ {
imgSrc, ok := sel.Eq(index).Attr("src")
if !ok {
continue
}
newImgSrc, err := UpdateResourcePath(imgSrc, prefix) // change the imgsrc here
if err != nil {
return "", err
}
sel.Eq(index).SetAttr("src", newImgSrc)
}
newtext, err := doc.Find("body").Html()
if err != nil {
return "", err
}
return newtext, nil
}
Related
I've been given a task to search for URLs in text file useng regex and goroutines with waitgroup in the way the given way: text should be devided between N workers (goroutines), each goroutine search for //https://, goroutines in waitgroup, final result should be a slice of strings (URLs) from all goroutines together.
Iam wotking with a txt.file with dozens of stuff in a single string but including URLs
right for now i know how to extract a slice of URLs from the text but without deviding a text and goroutines...
import (
"fmt"
"os"
"regexp"
"sync"
"time"
)
func Parser1(wg *sync.WaitGroup) {
time.Sleep((1 * time.Second))
b, err := os.ReadFile("repitations")
if err != nil {
fmt.Print(err)
}
str := string(b)
re := regexp.MustCompile(`(?:https?://)?(?:[^/.]+\.)*google\.com(?:/[^/\s]+)*/?`)
fmt.Printf("%q\n", re.FindAllString(str, -1))
wg.Done()
}
func Parser2(wg *sync.WaitGroup) {
time.Sleep((1 * time.Second))
b, err := os.ReadFile("repitations")
if err != nil {
fmt.Print(err)
}
str := string(b)
re := regexp.MustCompile(`(?:https?://)?(?:[^/.]+\.)*google\.com(?:/[^/\s]+)*/?`)
fmt.Printf("%q\n", re.FindAllString(str, -1))
wg.Done()
}
func main() {
var wg sync.WaitGroup
wg.Add(2)
go Parser1(&wg)
go Parser2(&wg)
wg.Wait()
fmt.Println("Well done!")
}````
Split your read process.
Open file with os.Open() and read sequentially with file.ReadAt().
Pass length to read and offset from start to Parser()
func Parser(wg *sync.WaitGroup, f *os.File, length int64, offset int64) {
defer wg.Done()
content := make([]byte, length)
_, err := f.ReadAt(content, offset)
if err != nil {
log.Fatal(err)
}
log.Printf("%s", content)
....
}
I have a file with a list of 600 regex patterns that most be performed in order to find a specific id for a website.
Example:
regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5
So far what i'm using is something like this
for _, line := range file {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(``+data[1]+``, website)
if match {
id, _ = strconv.Atoi(data[2])
}
}
}
This is working, but i wonder if there is a more optimized way to do this.
Because, if the website match with the regex on the top, great, but if not, i need to intenered the loop over and over till find it.
Anyone can help me to improve this?
Best regards
In order to reduce the time you can cache the regexp.
package main
import (
"bufio"
"bytes"
"fmt"
csvutils "github.com/alessiosavi/GoGPUtils/csv"
"log"
"os"
"regexp"
"strconv"
"strings"
"time"
)
func main() {
now := time.Now()
Precomputed("www.google.it")
fmt.Println(time.Since(now))
now = time.Now()
NonPrecomputed("www.google.it")
fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
for _, line := range cachedLines {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(``+data[1]+``, website)
if match {
id, _ := strconv.Atoi(data[2])
return id
}
}
}
return -1
}
func Precomputed(site string) int {
for regex, id := range rawRegex {
if ok := regex.MatchString(site); ok {
return id
}
}
return -1
}
var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string
func init() {
now := time.Now()
file, err := os.ReadFile("regex.txt")
if err != nil {
panic(err)
}
scanner := bufio.NewScanner(bytes.NewReader(file))
for scanner.Scan() {
txt := scanner.Text()
cachedLines = append(cachedLines, txt)
split := strings.Split(txt, "/")
if len(split) == 3 {
compile, err := regexp.Compile(split[1])
if err != nil {
panic(err)
}
if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
panic(err)
}
}
}
file, err = os.ReadFile("top500Domains.csv")
if err != nil {
panic(err)
}
_, csvData, err := csvutils.ReadCSV(file, ',')
if err != nil {
panic(err)
}
for _, line := range csvData {
sites = append(sites, line[1])
}
log.Println("Init took:", time.Since(now))
}
The init method take care of regexp cache. It will load all the regexp in a map with the relative index (it will load the test data too just for the benchmark).
Then you have 2 method:
Precomputed: use the map of cached regexp
NonPrecomputed: the copy->paste of your snippet
As you can see where the NonPrecomputed method is able to perform 63 execution, the Precomputed is able to perform 10000 execution.
As you can see the NonPrecomputed method allocate ~67 MB when the Precomputed method have no allocation (due to the initial cache)
C:\opt\SP\Workspace\Go\Temp>go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 # 3.00GHz
Benchmark_Precomputed-8 10000 1113887 ns/op 0 B/op 0 allocs/op
Benchmark_NonPrecomputed-8 63 298434740 ns/op 65782238 B/op 484595 allocs/op
PASS
ok Temp 41.548s
I am trying to find matching filesystem objects using Go and determine the type of path I received as input. Specifically, I need to perform actions on the object(s) if they match the path provided. Example input for the path could look like this:
/path/to/filename.ext
/path/to/dirname
/path/to/*.txt
I need to know if the path exists, is a file or a directory or a regex so I can process the input accordingly. Here's the solution I've devised so far:
func getPathType(path string) (bool, string, error) {
cpath := filepath.Clean(path)
l, err := filepath.Glob(cpath)
if err != nil {
return false, "", err
}
switch len(l) {
case 0:
return false, "", nil
case 1:
fsstat, fserr := os.Stat(cpath)
if fserr != nil {
return false, "", fserr
}
if fsstat.IsDir() {
return true, "dir", nil
}
return true, "file", nil
default:
return false, "regex", nil
}
}
I realize that the above code would allow a regex that returned a single value to be interpreted as a dir or file and not as a regex. For my purposes, I can let that slide but just curious if anyone has developed a better way of taking a path potentially containing regex as input and determining whether or not the last element is a regex or not.
Test for glob special characters to determine if the path is a glob pattern. Use filepath.Match to check for valid glob pattern syntax.
func getPathType(path string) (bool, string, error) {
cpath := filepath.Clean(path)
// Use Match to check glob syntax.
if _, err := filepath.Match(cpath, ""); err != nil {
return false, "", err
}
// If syntax is good and the path includes special
// glob characters, then it's a glob pattern.
special := `*?[`
if runtime.GOOS != "windows" {
special = `*?[\`
}
if strings.ContainsAny(cpath, special) {
return false, "regex", nil
}
fsstat, err := os.Stat(cpath)
if os.IsNotExist(err) {
return false, "", nil
} else if err != nil {
return false, "", err
}
if fsstat.IsDir() {
return true, "dir", nil
}
return true, "file", nil
}
I have the following IAM Policy:
{"Version":"2012-10-17","Statement":[{"Sid":"","Effect":"Allow","Principal":{"AWS":"arn:aws:sts::<account>:assumed-role/custom_role/<role>"},"Action":"sts:AssumeRole","Condition":{"StringEquals":{"sts:ExternalId":"<account>"}}}]}
but the "AWS" portion can also be an array:
"AWS": [
"arn:aws:sts::<account>:assumed-role/custom_role/<role_1>",
"arn:aws:sts::<account>:assumed-role/custom_role/<role_2>"
]
What I need is a regex that can parse both structures and return the list of arn:aws:sts as a list of strings... how can I accomplish that using regex in Golang?
I tried to use json.Unmarshal but the object structure is different between []string and string
Edit:
I have the following snippet:
re := regexp.MustCompile(`arn:aws:sts::[a-z0-9]*:assumed-role/custom_role/[a-z0-9]-*`)
result := re.FindAll([]byte(arn), 10)
for _, res := range result {
fmt.Println(string(res))
}
>>> `arn:aws:sts::<account_id>:assumed-role/custom_role/`
Using JSON decoder
You can decode the AWS key directly into a custom type implementing the "json.Unmarshaler" interface and decode both inputs correctly.
Demo
type AWSRoles []string
func (r *AWSRoles) UnmarshalJSON(b []byte) error {
var s string
if err := json.Unmarshal(b, &s); err == nil {
*r = append(*r, s)
return nil
}
var ss []string
if err := json.Unmarshal(b, &ss); err == nil {
*r = ss
return nil
}
return errors.New("cannot unmarshal neither to a string nor a slice of strings")
}
type AWSPolicy struct {
Statement []struct {
Principal struct {
AWSRoles AWSRoles `json:"AWS"`
} `json:"Principal"`
} `json:"Statement"`
}
Here's a test for it
var testsAWSPolicyParsing = []struct {
name string
input []byte
wantRoles []string
}{
{
name: "unique role",
input: []byte(`{"Version":"2012-10-17","Statement":[{"Sid":"","Effect":"Allow","Principal":{"AWS":"arn:aws:sts::<account>:assumed-role/custom_role/<role>"},"Action":"sts:AssumeRole","Condition":{"StringEquals":{"sts:ExternalId":"<account>"}}}]}`),
wantRoles: []string{"arn:aws:sts::<account>:assumed-role/custom_role/<role>"},
},
{
name: "multiple roles",
input: []byte(`{"Version":"2012-10-17","Statement":[{"Sid":"","Effect":"Allow","Principal":{"AWS":["arn:aws:sts::<account>:assumed-role/custom_role/<role_1>","arn:aws:sts::<account>:assumed-role/custom_role/<role_2>"]},"Action":"sts:AssumeRole","Condition":{"StringEquals":{"sts:ExternalId":"<account>"}}}]}`),
wantRoles: []string{
"arn:aws:sts::<account>:assumed-role/custom_role/<role_1>",
"arn:aws:sts::<account>:assumed-role/custom_role/<role_2>",
},
},
}
func TestParseAWSPolicy(t *testing.T) {
for _, tc := range testsAWSPolicyParsing {
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
var p AWSPolicy
err := json.Unmarshal(tc.input, &p)
if err != nil {
t.Fatal("unexpected error parsing AWSRoles policy", err)
}
if l := len(p.Statement); l != 1 {
t.Fatalf("unexpected Statement length. want 1, got %d", l)
}
if got := p.Statement[0].Principal.AWSRoles; !reflect.DeepEqual(got, tc.wantRoles) {
t.Fatalf("roles are not the same, got %v, want %v", got, tc.wantRoles)
}
})
}
}
Using a Regex
If you still want to use a regex, this one would parse it as long as:
AWS account has only numbers [0-9]
the custom role name has only alphanumeric characters and underscores
var awsRolesRegex = regexp.MustCompile("arn:aws:sts::[a-z0-9]+:assumed-role/custom_role/[a-zA-Z0-9_]+")
Demo
I have a simple program in Go to aid in learning regular expressions. It runs in an infinite loop and has 2 channels, one which is used to provide input (input contains regex pattern and subject), and the second one, which provides the output.
usage: main.exe (cat)+ catcatdog
However there is propably something wrong in the code, as i can't seem to get any results with the $ modifier.
For example, i expect "cat" output from
main.exe cat$ cat\ndog
yet receive zero results.
Code:
package main
import (
"fmt"
"regexp"
"bufio"
"os"
"strings"
)
type RegexRequest struct {
regex string
subject string
}
func main() {
regexRequests := make(chan *RegexRequest)
defer close(regexRequests)
regexAnswers, err := createResolver(regexRequests)
defer close(regexAnswers)
if(err != nil) { // TODO: Panics when exited via ctrl+c
panic(err)
}
interact(regexRequests, regexAnswers)
}
func interact(regexRequests chan *RegexRequest, regexAnswers chan []string) {
for {
fmt.Println("Enter regex and subject: ")
reader := bufio.NewReader(os.Stdin)
line, err := reader.ReadString('\n')
if(err != nil) {
panic(err)
}
regAndString := strings.SplitN(line, " ", 2);
if len(regAndString) != 2 {
fmt.Println("Invalid input, expected [regex][space][subject]")
continue
}
regexRequests <- &RegexRequest{ regAndString[0], regAndString[1] }
result := <- regexAnswers
var filteredResult []string
for _, element := range result {
if(element != "") {
filteredResult = append(filteredResult, element)
} else {
filteredResult = append(filteredResult, "EMPTY");
}
}
fmt.Println(strings.Join(filteredResult, " "))
}
}
func createResolver(inputChan chan *RegexRequest)(outputChan chan []string, err error) {
if(cap(inputChan) > 0) {
return nil, fmt.Errorf("Expected an unbuffered channel")
}
outputChan = make(chan []string)
err = nil
go func() {
for {
var regReq *RegexRequest= (<- inputChan);
var regex *regexp.Regexp = regexp.MustCompile(regReq.regex)
outputChan <- regex.FindAllString(regReq.subject, -1)
}
}()
return
}
Check your regex pattern. For example,
Enter regex and subject:
cat$ cat\ndog
Enter regex and subject:
^cat cat\ndog
cat