How to optimize a CSV loader with pprof? - regex

I am trying to optimize a CSV loading process that is basically doing a regex search in a large CSV file (+4GB - 31033993 records for my experiment)
I managed to build a multiprocessing logic to read the CSV but when I analyze the CPU profiling using pprof I think my regex search is not optimized. Could you help me improve this code so that it can read the CSV much quickly?
Here is my code so far:
package main
import (
"bufio"
"flag"
"fmt"
"log"
"os"
"regexp"
"runtime"
"runtime/pprof"
"strings"
"sync"
)
func processFile(path string) [][]string {
file, err := os.Open(path)
if err != nil {
log.Println("Error:", err)
}
var pattern = regexp.MustCompile(`^.*foo.*$`)
numCPU := runtime.NumCPU()
jobs := make(chan string, numCPU+1)
fmt.Printf("Strategy: Parallel, %d Workers ...\n", numCPU)
results := make(chan []string)
wg := new(sync.WaitGroup)
for w := 1; w <= numCPU; w++ {
wg.Add(1)
go parseRecord(jobs, results, wg, pattern)
}
go func() {
scanner := bufio.NewScanner(file)
for scanner.Scan() {
jobs <- scanner.Text()
}
close(jobs)
}()
go func() {
wg.Wait()
close(results)
}()
lines := [][]string{}
for line := range results {
lines = append(lines, line)
}
return lines
}
func parseRecord(jobs <-chan string, results chan<- []string, wg *sync.WaitGroup, pattern *regexp.Regexp) {
defer wg.Done()
for j := range jobs {
if pattern.MatchString(j) {
x := strings.Split(string(j), "\n")
results <- x
}
}
}
func split(r rune) bool {
return r == ','
}
func main() {
f, err := os.Create("perf.data")
if err != nil {
log.Fatal(err)
}
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
pathFlag := flag.String("file", "", `The CSV file to operate on.`)
flag.Parse()
lines := processFile(*pathFlag)
fmt.Println("loaded", len(lines), "records")
}
When I process the file without any regex constraint I am getting a reasonable computing time (I simply load the parsed string into the 2D array without any pattern.MatchString())
Strategy: Parallel, 8 Workers ...
loaded 31033993 records
2018/10/09 11:46:38 readLines took 30.611246035s
Instead, when I run the above code with the Regex constraint I am getting this result:
Strategy: Parallel, 8 Workers ...
loaded 143090 records
2018/10/09 12:04:32 readLines took 1m24.029830907s

MatchString looks for any match on the string
So you can get rid of the anchors and the wildcarding
The wildcarding at both ends is usually slow in regexp engines
example showing this on go 1.10
package reggie
import (
"regexp"
"testing"
)
var pattern = regexp.MustCompile(`^.*foo.*$`)
var pattern2 = regexp.MustCompile(`foo`)
func BenchmarkRegexp(b *testing.B) {
for i := 0; i < b.N; i++ {
pattern.MatchString("youfathairyfoobar")
}
}
func BenchmarkRegexp2(b *testing.B) {
for i := 0; i < b.N; i++ {
pattern2.MatchString("youfathairyfoobar")
}
}
$ go test -bench=.
goos: darwin
goarch: amd64
BenchmarkRegexp-4 3000000 471 ns/op
BenchmarkRegexp2-4 20000000 101 ns/op
PASS
ok _/Users/jsandrew/wip/src/reg 4.031s

Related

Golang: How to test functions that use time.Now()? [duplicate]

This question already has answers here:
Is there an easy way to stub out time.Now() globally during test?
(10 answers)
Golang testing programs that involves time
(1 answer)
Closed 21 days ago.
I have a function that is calculating the number of days since a particular timestamp, where the timestamp is coming from an external API (parsed as string in json return from API)
I have been following this article on how to test functions that use time.Now():
https://medium.com/go-for-punks/how-to-test-functions-that-use-time-now-ea4f2453d430
My function looks like this:
type funcTimeType func() time.Time // per suggested in article
func ageOfReportDays(dateString string, funcTime funcTimeType) {
// date string will look like this:
//"2022-08-30 09:05:27.567995"
parseLayout := "2006-01-02 15:04:05.000000"
t, err := time.Parse(parseLayout, dateString)
if err != nil {
fmt.Printf("Error parsing datetime value %v: %w", timeStr, err)
}
days := int(time.Since(t).Abs().Hours() / 24)
//fmt.Println(days)
return days, nil
}
As you can see, I am not using the funcTime funcTimeType in my actual function, as indicated in the article, because I cannot figure out how my function would be implemented with that.
The unit test I would hope to run would be something like this:
func Test_ageOfReportDays(t *testing.T) {
t.Run("timestamp age in days test", func(t *testing.T) {
parseLayout := "2006-01-02 15:04:05.000000"
dateString := "2022-08-30 09:05:27.567995" // example of recent timestamp
mockNow := func() time.Time {
fakeTime, _ := time.Parse(parseLayout, "2023-01-20 09:00:00.000000")
return fakeTime
}
// now I want to use "fakeTime" to spoof "time.Now()" so I can test my function
got: ageOfReportDays(dateString, mockNow)
expected: 152
if got != expected {
t.Errorf("expected '%d' but got '%d'", expected, got)
}
}
Obviously the logic is not quite with my code vs article author's code.
Is there a good way for me to write a unit test for this funcition, based on how the article is suggesting to mock time.Now()?
You are pretty close. Changing time.Since(t) to funcTime().Sub(t) would probably get you passed the finish line.
From time package docs:
time.Since returns the time elapsed since t. It is shorthand for time.Now().Sub(t).
Example function:
import (
"fmt"
"time"
)
const parseLayout = "2006-01-02 15:04:05.000000"
type funcTimeType func() time.Time // per suggested in article
func ageOfReportDays(dateString string, funcTime funcTimeType) (int, error) {
t, err := time.Parse(parseLayout, dateString)
if err != nil {
return 0, fmt.Errorf("parsing datetime value %v: %w", dateString, err)
}
days := int(funcTime().Sub(t).Hours() / 24)
//fmt.Println(days)
return days, nil
}
And a test:
import (
"testing"
"time"
)
func Test_ageOfReportDays(t *testing.T) {
t.Run("timestamp age in days test", func(t *testing.T) {
dateString := "2022-08-30 09:05:27.567995" // example of recent timestamp
mockNow := func() time.Time {
fakeTime, _ := time.Parse(parseLayout, "2023-01-20 09:00:00.000000")
return fakeTime
}
// now I want to use "fakeTime" to spoof "time.Now()" so I can test my function
got, _ := ageOfReportDays(dateString, mockNow)
expected := 142
if got != expected {
t.Errorf("expected '%d' but got '%d'", expected, got)
}
})
}

How to extract substrings [duplicate]

This question already has answers here:
How to extract a floating number from a string [duplicate]
(7 answers)
Closed 2 years ago.
I wrote in Go the following code to extract two values ​​inside the string.
I used two regexp to seek the numbers (float64).
The first result is the correct, only de number. But the second is wrong.
This is the code:
package main
import (
"fmt"
"regexp"
)
func main() {
// RegExp utiliza la sintaxis RE2
pat1 := regexp.MustCompile(`[^m2!3d][\d\.-]+`)
s1 := pat1.FindString(`Torre+Eiffel!8m2!3d-48.8583701!4d-2.2944813!3m4!1s0x47e66e2964e34e2d:0x8ddca9ee380ef7e0!8m2!3d-48.8583701!4d-2.2944813`)
pat2 := regexp.MustCompile(`[^!4d][\d\.-]+`)
s2 := pat2.FindString(`Torre+Eiffel!8m2!3d-48.8583701!4d-2.2944813!3m4!1s0x47e66e2964e34e2d:0x8ddca9ee380ef7e0!8m2!3d-48.8583701!4d-2.2944813`)
fmt.Println(s1) // Print -> -48.8583701
fmt.Println(s2) // Print -> m2 (The correct answer is "-2.2944813")
}
Here I modify the syntax
pat2 := regexp.MustCompile(!4d[\d\.-]+)
and I get the following answer:
!4d-2.2944813
but it's not what I'm expecting.
It seems like you are only interessed in the latitude and longitute of an attraction and not really in the regex.
Maybe you just use something like this:
package main
import (
"fmt"
"strconv"
"strings"
)
var replacer = strings.NewReplacer("3d-", "", "4d-", "")
func main() {
var str = `Torre+Eiffel!8m2!3d-48.8583701!4d-2.2944813!3m4!1s0x47e66e2964e34e2d:0x8ddca9ee380ef7e0!8m2!3d-48.8583701!4d-2.2944813`
fmt.Println(getLatLong(str))
}
func getLatLong(str string) (float64, float64, error) {
parts := strings.Split(str, "!")
if latFloat, err := strconv.ParseFloat(replacer.Replace(parts[2]), 64); err != nil {
return 0, 0, err
} else if lngFloat, err := strconv.ParseFloat(replacer.Replace(parts[3]), 64); err != nil {
return 0, 0, err
} else {
return latFloat, lngFloat, nil
}
}
https://play.golang.org/p/UOIwGbl6nrb
You where almost there. Try (?m)(?:3d|4d)-([\d\.-]+)(?:!|$)
https://regex101.com/r/8KgirB/1
All you need is a matching group around the [\d\.-]+ part. With this group you are able to access it directly
package main
import (
"fmt"
"regexp"
)
func main() {
var re = regexp.MustCompile(`(?m)(?:3d|4d)-([\d\.-]+)!`)
var str = `Torre+Eiffel!8m2!3d-48.8583701!4d-2.2944813!3m4!1s0x47e66e2964e34e2d:0x8ddca9ee380ef7e0!8m2!3d-48.8583701!4d-2.2944813`
for _, match := range re.FindAllStringSubmatch(str, -1) {
fmt.Println(match[1])
}
}

How can I unit test that a text will appear in the center of the screen?

This is a little script in go.
package bashutil
import (
"fmt"
"github.com/nsf/termbox-go"
)
func Center(s string) {
if err := termbox.Init(); err != nil {
panic(err)
}
w, _ := termbox.Size()
termbox.Close()
fmt.Printf(
fmt.Sprintf("%%-%ds", w/2),
fmt.Sprintf(fmt.Sprintf("%%%ds", w/2+len(s)/2), s),
)
}
Can I unit test it? How can I test it? I think is a nonsense test a snippet so little. But, ... What if I would test this code? How can I test that an output is equals as I expect?
Can I test that fmt prints something like I expect?
What means "test" ?
I think "test" need have effect on output of a function.
Your function's output is Stdout, so we need get the output first.
We can do this simply:
func TestCenter(*testing.T) {
stdoutBak := os.Stdout
r, w, _ := os.Pipe()
os.Stdout = w
Center("hello")
w.Close()
os.Stdout = stdoutBak
// Check output as a byte array
outstr, _ := ioutil.ReadAll(r)
fmt.Printf("%s", outstr)
}
Thus, you can check output format, spelling, etc.

Subset of table driven test

For testing functions I could select which will run by option -run.
go test -run regex
Very common if we have dozens test cases is put it into array in order not to write function for each of that:
cases := []struct {
arg, expected string
} {
{"%a", "[%a]"},
{"%-a", "[%-a]"},
// and many others
}
for _, c := range cases {
res := myfn(c.arg)
if res != c.expected {
t.Errorf("myfn(%q) should return %q, but it returns %q", c.arg, c.expected, res)
}
}
This work good, but problem is with maintanance. When I add a new testcase, while debugging I want to start just a new test case, but I cannot say something like:
go test -run TestMyFn.onlyThirdCase
Is there any elegant way, how to have many testcases in array together with ability to choose which testcase will run?
With Go 1.6 (and below)
This is not supported directly by the testing package in Go 1.6 and below. You have to implement it yourself.
But it's not that hard. You can use flag package to easily access command line arguments.
Let's see an example. We define an "idx" command line parameter, which if present, only the case at that index will be executed, else all test cases.
Define flag:
var idx = flag.Int("idx", -1, "specify case index to run only")
Parse command line flags (actually, this is not required as go test already calls this, but just to be sure / complete):
func init() {
flag.Parse()
}
Using this parameter:
for i, c := range cases {
if *idx != -1 && *idx != i {
println("Skipping idx", i)
continue
}
if res := myfn(c.arg); res != c.expected {
t.Errorf("myfn(%q) should return %q, but it returns %q", c.arg, c.expected, res)
}
}
Testing it with 3 test cases:
cases := []struct {
arg, expected string
}{
{"%a", "[%a]"},
{"%-a", "[%-a]"},
{"%+a", "[%+a]"},
}
Without idx parameter:
go test
Output:
PASS
ok play 0.172s
Specifying an index:
go test -idx=1
Output:
Skipping idx 0
Skipping idx 2
PASS
ok play 0.203s
Of course you can implement more sophisticated filtering logic, e.g. you can have minidx and maxidx flags to run cases in a range:
var (
minidx = flag.Int("minidx", 0, "min case idx to run")
maxidx = flag.Int("maxidx", -1, "max case idx to run")
)
And the filtering:
if i < *minidx || *maxidx != -1 && i > *maxidx {
println("Skipping idx", i)
continue
}
Using it:
go test -maxidx=1
Output:
Skipping idx 2
PASS
ok play 0.188s
Starting with Go 1.7
Go 1.7 (to be released on August 18, 2016) adds the definition of subtests and sub-benchmarks:
The testing package now supports the definition of tests with subtests and benchmarks with sub-benchmarks. This support makes it easy to write table-driven benchmarks and to create hierarchical tests. It also provides a way to share common setup and tear-down code. See the package documentation for details.
With that, you can do things like:
func TestFoo(t *testing.T) {
// <setup code>
t.Run("A=1", func(t *testing.T) { ... })
t.Run("A=2", func(t *testing.T) { ... })
t.Run("B=1", func(t *testing.T) { ... })
// <tear-down code>
}
Where the subtests are named "A=1", "A=2", "B=1".
The argument to the -run and -bench command-line flags is a slash-separated list of regular expressions that match each name element in turn. For example:
go test -run Foo # Run top-level tests matching "Foo".
go test -run Foo/A= # Run subtests of Foo matching "A=".
go test -run /A=1 # Run all subtests of a top-level test matching "A=1".
How does this help your case? The names of subtests are string values, which can be generated on-the-fly, e.g.:
for i, c := range cases {
name := fmt.Sprintf("C=%d", i)
t.Run(name, func(t *testing.T) {
if res := myfn(c.arg); res != c.expected {
t.Errorf("myfn(%q) should return %q, but it returns %q",
c.arg, c.expected, res)
}
})
}
To run the case at index 2, you could start it like
go test -run /C=2
or
go test -run TestName/C=2
I wrote a simple code, that work fine with both, although with a bit different command line options. Version for 1.7 is:
// +build go1.7
package plist
import "testing"
func runTest(name string, fn func(t *testing.T), t *testing.T) {
t.Run(name, fn)
}
and 1.6 and older:
// +build !go1.7
package plist
import (
"flag"
"testing"
"runtime"
"strings"
"fmt"
)
func init() {
flag.Parse()
}
var pattern = flag.String("pattern", "", "specify which test(s) should be executed")
var verbose = flag.Bool("verbose", false, "write whether test was done")
// This is a hack, that a bit simulate t.Run available from go1.7
func runTest(name string, fn func(t *testing.T), t *testing.T) {
// obtain name of caller
var pc[10]uintptr
runtime.Callers(2, pc[:])
var fnName = ""
f := runtime.FuncForPC(pc[0])
if f != nil {
fnName = f.Name()
}
names := strings.Split(fnName, ".")
fnName = names[len(names)-1] + "/" + name
if strings.Contains(fnName, *pattern) {
if *verbose {
fmt.Printf("%s is executed\n", fnName)
}
fn(t)
} else {
if *verbose {
fmt.Printf("%s is skipped\n", fnName)
}
}
}

How can I write a Go test that writes to stdin?

Say that I have a simple application that reads lines from stdin and simply echoes it back to stdout. For example:
package main
import (
"bufio"
"fmt"
"io"
"os"
)
func main() {
reader := bufio.NewReader(os.Stdin)
for {
fmt.Print("> ")
bytes, _, err := reader.ReadLine()
if err == io.EOF {
os.Exit(0)
}
fmt.Println(string(bytes))
}
}
I would like to write a test case that writes to stdin and then compares the output to the input. For example:
package main
import (
"bufio"
"io"
"os"
"os/exec"
"testing"
)
func TestInput(t *testing.T) {
subproc := exec.Command(os.Args[0])
stdin, _ := subproc.StdinPipe()
stdout, _ := subproc.StdoutPipe()
defer stdin.Close()
input := "abc\n"
subproc.Start()
io.WriteString(stdin, input)
reader := bufio.NewReader(stdout)
bytes, _, _ := reader.ReadLine()
output := string(bytes)
if input != output {
t.Errorf("Wanted: %v, Got: %v", input, output)
}
subproc.Wait()
}
Running go test -v gives me the following:
=== RUN TestInput
--- FAIL: TestInput (3.32s)
echo_test.go:25: Wanted: abc
, Got: --- FAIL: TestInput (3.32s)
FAIL
exit status 1
I'm obviously doing something incorrect here. How should I go about testing this type of code?
Instead of doing everything in main with stdin and stdout, you can define a function that takes an io.Reader and an io.Writer as parameters and does whatever you want it to do. main could then call that function and your test function could test that function directly.
Here is an example that writes to stdin and reads from stdout. Note that it does not work because the output contains "> " at first. Still, you can modify it to suit your needs.
func TestInput(t *testing.T) {
subproc := exec.Command("yourCmd")
input := "abc\n"
subproc.Stdin = strings.NewReader(input)
output, _ := subproc.Output()
if input != string(output) {
t.Errorf("Wanted: %v, Got: %v", input, string(output))
}
subproc.Wait()
}