Go - regex inside loop - regex

I have a file with a list of 600 regex patterns that most be performed in order to find a specific id for a website.
Example:
regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5
So far what i'm using is something like this
for _, line := range file {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(``+data[1]+``, website)
if match {
id, _ = strconv.Atoi(data[2])
}
}
}
This is working, but i wonder if there is a more optimized way to do this.
Because, if the website match with the regex on the top, great, but if not, i need to intenered the loop over and over till find it.
Anyone can help me to improve this?
Best regards

In order to reduce the time you can cache the regexp.
package main
import (
"bufio"
"bytes"
"fmt"
csvutils "github.com/alessiosavi/GoGPUtils/csv"
"log"
"os"
"regexp"
"strconv"
"strings"
"time"
)
func main() {
now := time.Now()
Precomputed("www.google.it")
fmt.Println(time.Since(now))
now = time.Now()
NonPrecomputed("www.google.it")
fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
for _, line := range cachedLines {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(``+data[1]+``, website)
if match {
id, _ := strconv.Atoi(data[2])
return id
}
}
}
return -1
}
func Precomputed(site string) int {
for regex, id := range rawRegex {
if ok := regex.MatchString(site); ok {
return id
}
}
return -1
}
var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string
func init() {
now := time.Now()
file, err := os.ReadFile("regex.txt")
if err != nil {
panic(err)
}
scanner := bufio.NewScanner(bytes.NewReader(file))
for scanner.Scan() {
txt := scanner.Text()
cachedLines = append(cachedLines, txt)
split := strings.Split(txt, "/")
if len(split) == 3 {
compile, err := regexp.Compile(split[1])
if err != nil {
panic(err)
}
if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
panic(err)
}
}
}
file, err = os.ReadFile("top500Domains.csv")
if err != nil {
panic(err)
}
_, csvData, err := csvutils.ReadCSV(file, ',')
if err != nil {
panic(err)
}
for _, line := range csvData {
sites = append(sites, line[1])
}
log.Println("Init took:", time.Since(now))
}
The init method take care of regexp cache. It will load all the regexp in a map with the relative index (it will load the test data too just for the benchmark).
Then you have 2 method:
Precomputed: use the map of cached regexp
NonPrecomputed: the copy->paste of your snippet
As you can see where the NonPrecomputed method is able to perform 63 execution, the Precomputed is able to perform 10000 execution.
As you can see the NonPrecomputed method allocate ~67 MB when the Precomputed method have no allocation (due to the initial cache)
C:\opt\SP\Workspace\Go\Temp>go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 # 3.00GHz
Benchmark_Precomputed-8 10000 1113887 ns/op 0 B/op 0 allocs/op
Benchmark_NonPrecomputed-8 63 298434740 ns/op 65782238 B/op 484595 allocs/op
PASS
ok Temp 41.548s

Related

How to use go-routines while parsing text file for URLs with regex

I've been given a task to search for URLs in text file useng regex and goroutines with waitgroup in the way the given way: text should be devided between N workers (goroutines), each goroutine search for //https://, goroutines in waitgroup, final result should be a slice of strings (URLs) from all goroutines together.
Iam wotking with a txt.file with dozens of stuff in a single string but including URLs
right for now i know how to extract a slice of URLs from the text but without deviding a text and goroutines...
import (
"fmt"
"os"
"regexp"
"sync"
"time"
)
func Parser1(wg *sync.WaitGroup) {
time.Sleep((1 * time.Second))
b, err := os.ReadFile("repitations")
if err != nil {
fmt.Print(err)
}
str := string(b)
re := regexp.MustCompile(`(?:https?://)?(?:[^/.]+\.)*google\.com(?:/[^/\s]+)*/?`)
fmt.Printf("%q\n", re.FindAllString(str, -1))
wg.Done()
}
func Parser2(wg *sync.WaitGroup) {
time.Sleep((1 * time.Second))
b, err := os.ReadFile("repitations")
if err != nil {
fmt.Print(err)
}
str := string(b)
re := regexp.MustCompile(`(?:https?://)?(?:[^/.]+\.)*google\.com(?:/[^/\s]+)*/?`)
fmt.Printf("%q\n", re.FindAllString(str, -1))
wg.Done()
}
func main() {
var wg sync.WaitGroup
wg.Add(2)
go Parser1(&wg)
go Parser2(&wg)
wg.Wait()
fmt.Println("Well done!")
}````
Split your read process.
Open file with os.Open() and read sequentially with file.ReadAt().
Pass length to read and offset from start to Parser()
func Parser(wg *sync.WaitGroup, f *os.File, length int64, offset int64) {
defer wg.Done()
content := make([]byte, length)
_, err := f.ReadAt(content, offset)
if err != nil {
log.Fatal(err)
}
log.Printf("%s", content)
....
}

Can you mock the page values in an AWS API Paginator/paginated/Pages call?

Is there a way to return test page values returned from the AWS API paginators to test the code below? If not, I suppose it's better to split the tag checking into a function that can be tested in isolation?
Note: This is just an example, I realize there are input Filters on the I can apply to the API call to achieve the same thing demonstrated here.
package main
import (
"fmt"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/ec2"
"github.com/aws/aws-sdk-go/service/ec2/ec2iface"
)
type handler struct {
EC2 ec2iface.EC2API
}
func main() {
sess := session.New()
client := ec2.New(sess)
h := &handler{EC2: client}
tagged, err := h.findTagged()
if err != nil {
panic(err)
}
fmt.Println(tagged)
}
func (h *handler) findTagged() ([]string, error) {
defaults := []string{}
input := &ec2.DescribeVpcsInput{}
err := h.EC2.DescribeVpcsPages(input, func(page *ec2.DescribeVpcsOutput, lastPage bool) bool {
for _, p := range page.Vpcs {
for _, t := range p.Tags {
if aws.StringValue(t.Key) == "test" {
defaults = append(defaults, aws.StringValue(p.VpcId))
}
}
}
return false
})
return defaults, err
}
This is described on the official documentation (Unit Testing with the AWS SDK for Go V2 - How to mock the AWS SDK for Go V2 when unit testing your application
Extract from the page:
import "context"
import "fmt"
import "testing"
import "github.com/aws/aws-sdk-go-v2/service/s3"
// ...
type mockListObjectsV2Pager struct {
PageNum int
Pages []*s3.ListObjectsV2Output
}
func (m *mockListObjectsV2Pager) HasMorePages() bool {
return m.PageNum < len(m.Pages)
}
func (m *mockListObjectsV2Pager) NextPage(ctx context.Context, f ...func(*s3.Options)) (output *s3.ListObjectsV2Output, err error) {
if m.PageNum >= len(m.Pages) {
return nil, fmt.Errorf("no more pages")
}
output = m.Pages[m.PageNum]
m.PageNum++
return output, nil
}
func TestCountObjects(t *testing.T) {
pager := &mockListObjectsV2Pager{
Pages: []*s3.ListObjectsV2Output{
{
KeyCount: 5,
},
{
KeyCount: 10,
},
{
KeyCount: 15,
},
},
}
objects, err := CountObjects(context.TODO(), pager)
if err != nil {
t.Fatalf("expect no error, got %v", err)
}
if expect, actual := 30, objects; expect != actual {
t.Errorf("expect %v, got %v", expect, actual)
}
}

DynamoDB list all backups using AWS GoLang SDK

Based on the example given in the link blow on API Operation Pagination without Callbacks
https://aws.amazon.com/blogs/developer/context-pattern-added-to-the-aws-sdk-for-go/
I am trying to list all the Backups in dynamodb. But it seems like pagination is not working and it is just retrieving first page and not going to next page
package main
import (
"context"
"fmt"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/request"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/dynamodb"
)
func main() {
sess, sessErr := session.NewSession()
if sessErr != nil {
fmt.Println(sessErr)
fmt.Println("Cound not initilize session..returning..")
return
}
// Create DynamoDB client
dynamodbSvc := dynamodb.New(sess)
params := dynamodb.ListBackupsInput{}
ctx := context.Background()
p := request.Pagination{
NewRequest: func() (*request.Request, error) {
req, _ := dynamodbSvc.ListBackupsRequest(&params)
req.SetContext(ctx)
return req, nil
},
}
for p.Next() {
page := p.Page().(*dynamodb.ListBackupsOutput)
fmt.Println("Received", len(page.BackupSummaries), "objects in page")
for _, obj := range page.BackupSummaries {
fmt.Println(aws.StringValue(obj.BackupName))
}
}
//return p.Err()
} //end of main
Its a bit late but I'll just put it here in case I can help somebody.
Example:
var exclusiveStartARN *string
var backups []*dynamodb.BackupSummary
for {
backup, err := svc.ListBackups(&dynamodb.ListBackupsInput{
ExclusiveStartBackupArn:exclusiveStartARN,
})
if err != nil {
fmt.Println(err)
os.Exit(1)
}
backups = append(backups, backup.BackupSummaries...)
if backup.LastEvaluatedBackupArn != nil {
exclusiveStartARN = backup.LastEvaluatedBackupArn
//max 5 times a second so we dont hit the limit
time.Sleep(200 * time.Millisecond)
continue
}
break
}
fmt.Println(len(backups))
Explaination:
The way that pagination is done is via ExclusiveStartBackupArn in the ListBackupsRequest. The ListBackupsResponse returns LastEvaluatedBackupArn if there are more pages, or nil if its the last/only page.
It could be that you're smashing into the API a bit with your usage
You can call ListBackups a maximum of 5 times per second.
What is the value of p.HasNextPage() in your p.Next() loop?

how can I reduce cpu usage in a golang tcp server?

I try to implement a golang tcp server, and I found the concurrency is satisfied for me, but the CPU usage is too high(concurrency is 15W+/s, but the CPU usage is about 800% in a 24 cores linux machine). At the same time, a C++ tcp server is only about 200% usage with a similar concurrency(with libevent).
The following code is the demo of golang:
func main() {
listen, err := net.Listen("tcp", "0.0.0.0:17379")
if err != nil {
fmt.Errorf(err.Error())
}
go acceptClient(listen)
var channel2 = make(chan bool)
<-channel2
}
func acceptClient(listen net.Listener) {
for {
sock, err := listen.Accept()
if err != nil {
fmt.Errorf(err.Error())
}
tcp := sock.(*net.TCPConn)
tcp.SetNoDelay(true)
var channel = make(chan bool, 10)
go read(channel, sock.(*net.TCPConn))
go write(channel, sock.(*net.TCPConn))
}
}
func read(channel chan bool, sock *net.TCPConn) {
count := 0
for {
var buf = make([]byte, 1024)
n, err := sock.Read(buf)
if err != nil {
close(channel)
sock.CloseRead()
return
}
count += n
x := count / 58
count = count % 58
for i := 0; i < x; i++ {
channel <- true
}
}
}
func write(channel chan bool, sock *net.TCPConn) {
buf := []byte("+OK\r\n")
defer func() {
sock.CloseWrite()
recover()
}()
for {
_, ok := <-channel
if !ok {
return
}
_, writeError := sock.Write(buf)
if writeError != nil {
return
}
}
}
And I test this tcp server by the redis-benchmark with multi-clients:
redis-benchmark -h 10.100.45.2 -p 17379 -n 1000 -q script load "redis.call('set','aaa','aaa')"
I also analyzed my golang code by the pprof, it is said CPU cost a lot of time on syscall:
enter image description here
I don't think parallelise the read and write with channel will provide you better performance in this case. You should try to do less memory allocation and less syscall (The write function may do a lot of syscalls)
Can you try this version?
package main
import (
"bytes"
"fmt"
"net"
)
func main() {
listen, err := net.Listen("tcp", "0.0.0.0:17379")
if err != nil {
fmt.Errorf(err.Error())
}
acceptClient(listen)
}
func acceptClient(listen net.Listener) {
for {
sock, err := listen.Accept()
if err != nil {
fmt.Errorf(err.Error())
}
tcp := sock.(*net.TCPConn)
tcp.SetNoDelay(true)
go handleConn(tcp) // less go routine creation but no concurrent read/write on the same conn
}
}
var respPattern = []byte("+OK\r\n")
// just one goroutine per conn
func handleConn(sock *net.TCPConn) {
count := 0
buf := make([]byte, 4098) // Do not create a new buffer each time & increase the buff size
defer sock.Close()
for {
n, err := sock.Read(buf)
if err != nil {
return
}
count += n
x := count / 58
count = count % 58
resp := bytes.Repeat(respPattern, x) // can be optimize
_, writeError := sock.Write(resp) // do less syscall
if writeError != nil {
return
}
}
}

Regex end modifier not returning results in a Go program

I have a simple program in Go to aid in learning regular expressions. It runs in an infinite loop and has 2 channels, one which is used to provide input (input contains regex pattern and subject), and the second one, which provides the output.
usage: main.exe (cat)+ catcatdog
However there is propably something wrong in the code, as i can't seem to get any results with the $ modifier.
For example, i expect "cat" output from
main.exe cat$ cat\ndog
yet receive zero results.
Code:
package main
import (
"fmt"
"regexp"
"bufio"
"os"
"strings"
)
type RegexRequest struct {
regex string
subject string
}
func main() {
regexRequests := make(chan *RegexRequest)
defer close(regexRequests)
regexAnswers, err := createResolver(regexRequests)
defer close(regexAnswers)
if(err != nil) { // TODO: Panics when exited via ctrl+c
panic(err)
}
interact(regexRequests, regexAnswers)
}
func interact(regexRequests chan *RegexRequest, regexAnswers chan []string) {
for {
fmt.Println("Enter regex and subject: ")
reader := bufio.NewReader(os.Stdin)
line, err := reader.ReadString('\n')
if(err != nil) {
panic(err)
}
regAndString := strings.SplitN(line, " ", 2);
if len(regAndString) != 2 {
fmt.Println("Invalid input, expected [regex][space][subject]")
continue
}
regexRequests <- &RegexRequest{ regAndString[0], regAndString[1] }
result := <- regexAnswers
var filteredResult []string
for _, element := range result {
if(element != "") {
filteredResult = append(filteredResult, element)
} else {
filteredResult = append(filteredResult, "EMPTY");
}
}
fmt.Println(strings.Join(filteredResult, " "))
}
}
func createResolver(inputChan chan *RegexRequest)(outputChan chan []string, err error) {
if(cap(inputChan) > 0) {
return nil, fmt.Errorf("Expected an unbuffered channel")
}
outputChan = make(chan []string)
err = nil
go func() {
for {
var regReq *RegexRequest= (<- inputChan);
var regex *regexp.Regexp = regexp.MustCompile(regReq.regex)
outputChan <- regex.FindAllString(regReq.subject, -1)
}
}()
return
}
Check your regex pattern. For example,
Enter regex and subject:
cat$ cat\ndog
Enter regex and subject:
^cat cat\ndog
cat