Go webcrawler hangs after checking about 2000 urls

Go webcrawler hangs after checking about 2000 urls - concurrency

I have a program to check whether keywords are on a web page. But after checking 1000-3000 urls, it hangs. There is no output, it does not exit, and the number of tcp connections is zero. I don't know why there are no new connections.
Would you give me some advice how to debug it?
type requestReturn struct {
url string
status bool
}
var timeout = time.Duration(800 * time.Millisecond)
func checkUrls(urls []string, kws string, threadLimit int) []string {
limitChan := make(chan int, threadLimit)
ok := make(chan requestReturn, 1)
var result []string
i := 0
for ; i < threadLimit; i++ {
go func(u string) {
request(u, limitChan, ok, kws)
}(urls[i])
}
for o := range ok {
if o.status {
result = append(result, o.url)
log.Printf("success %s,remain %d", o.url, len(urls)-i)
} else {
log.Printf("fail %s,remain %d", o.url, len(urls)-i)
}
if i < len(urls) {
go func(u string) {
request(u, limitChan, ok, kws)
}(urls[i])
i++
}
}
close(limitChan)
return result
}
func dialTimeout(network, addr string) (net.Conn, error) {
return net.DialTimeout(network, addr, timeout)
}
func request(url string, threadLimit chan int, ok chan requestReturn, kws string) {
threadLimit <- 1
log.Printf("%s, start...", url)
//startTime := time.Now().UnixNano()
rr := requestReturn{url: url}
transport := http.Transport{
Dial: dialTimeout,
DisableKeepAlives: true,
}
client := http.Client{
Transport: &transport,
Timeout: time.Duration(15 * time.Second),
}
resp, e := client.Get(url)
if e != nil {
log.Printf("%q", e)
rr.status = false
return
}
if resp.StatusCode == 200 {
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Printf("%q", err)
rr.status = false
return
}
content := bytes.NewBuffer(body).String()
matched, err1 := regexp.MatchString(kws, content)
if err1 != nil {
log.Printf("%q", err1)
rr.status = false
} else if matched {
rr.status = true
log.Println(rr.url)
} else {
rr.status = false
}
} else {
rr.status = false
}
defer (func() {
resp.Body.Close()
ok <- rr
//processed := float32(time.Now().UnixNano()-startTime) / 1e9
//log.Printf("%s, status:%t,time:%.3fs", rr.url, rr.status, processed)
<-threadLimit
})()
}

You seem to be using two forms of concurrency control in this code, and both have problems.
You've got limitChan, which looks like it is being used as a semaphore (request sends a value at its start, and receives a value in a defer in that function). But checkUrls is also trying to make sure it only has threadLimit goroutines running at once (by spawning that number first up, and only spawning more when one reports its results on the ok channel). Only one of these should be necessary to limit the concurrency.
Both methods fail due to the way the defer is set up in request. There are a number of return statements that occur before defer, so it is possible for the function to complete without sending the result to the ok channel, and without freeing up its slot in limitChan. After a sufficient number of errors, checkUrls will stop spawning new goroutines and you'll see your hang.
The fix is to place the defer statement before any of the return statements so you know it will always be run. Something like this:
func request(url string, threadLimit chan int, ok chan requestReturn, kws string) {
threadLimit <- 1
rr := requestReturn{url: url}
var resp *http.Response
defer func() {
if resp != nil {
resp.Body.Close()
}
ok <- rr
<-threadLimit
}()
...
}

Related

How to solve this syncing issue in Unit testing golang?

Here when I'm printing the activity it is printing them in the order they are getting created but at the time of assertion it is picking up activities in random order and also it is picking expected values in random order. The api that I'm calling mastercontroller have some goroutines and could take time maybe that is the reason but not sure.
for i, param := range params {
gin.SetMode(gin.TestMode)
w := httptest.NewRecorder()
ctx, _ := gin.CreateTestContext(w)
ctx.Request = &http.Request{
URL: &url.URL{},
Header: make(http.Header),
}
MockJsonPost(ctx, param)
MasterController(ctx)
time.Sleep(3 * time.Second)
fmt.Println("response body", string(w.Body.Bytes()))
fmt.Println("status", w.Code)
// var activity *activity.Activity
activity, err := activityController.GetLastActivity(nil)
//tx.Raw("select * from activity order by id desc limit 1").Find(&activity)
if err != nil {
fmt.Println("No activity found")
}
activityJson, err := activity.ToJsonTest()
if err != nil {
fmt.Println("error converting in json")
}
fmt.Printf("reponse activity %+v", string(activityJson))
assert.EqualValues(t, string(expected[i]), string(activityJson))
}
func MockJsonPost(c *gin.Context, content interface{}) {
c.Request.Method = "POST" // or PUT
c.Request.Header.Set("Content-Type", "application/json")
jsonbytes, err := json.Marshal(content)
if err != nil {
panic(err)
}
// the request body must be an io.ReadCloser
// the bytes buffer though doesn't implement io.Closer,
// so you wrap it in a no-op closer
c.Request.Body = io.NopCloser(bytes.NewBuffer(jsonbytes))
}

Testing with Gomock returns error: Expected call has already been called the max number of times

I am using Gomock https://godoc.org/github.com/golang/mock and mockgen
The Source code for this test is:
package sqs
import (
"fmt"
"log"
"os"
"runtime"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/service/sqs"
"github.com/aws/aws-sdk-go/service/sqs/sqsiface"
)
var sess *session.Session
var svc *sqs.SQS
var queueURL string
func init() {
// Setting the runtime to run with max CPUs available
runtime.GOMAXPROCS(runtime.NumCPU())
sess = session.Must(session.NewSessionWithOptions(session.Options{
SharedConfigState: session.SharedConfigEnable,
}))
svc = sqs.New(sess)
queueURL = os.Getenv("QUEUE_URL")
}
type Poller interface {
Poll(chan bool)
}
// NewPoller is a factory to create a Poller object
func NewPoller(msgr Messenger) Poller {
p := &poller{
m: msgr,
}
return p
}
type poller struct {
m Messenger
}
func (p *poller) Poll(done chan bool) {
sqsMsgCh := make(chan *sqs.Message, 100)
for {
messages, err := p.m.GetMessage()
if err != nil {
log.Printf("error when getting message")
if len(messages) == 0 {
// Stop the system
log.Printf("I am here")
done <- true
}
}
for _, msg := range messages {
sqsMsgCh <- msg
}
}
}
type Messenger interface {
GetMessage() ([]*sqs.Message, error)
}
func NewMessenger() Messenger {
return &messenger{
s: svc,
}
}
type messenger struct {
s sqsiface.SQSAPI
}
func (m *messenger) GetMessage() ([]*sqs.Message, error) {
result, err := m.s.ReceiveMessage(&sqs.ReceiveMessageInput{
AttributeNames: []*string{
aws.String(sqs.MessageSystemAttributeNameSentTimestamp),
},
MessageAttributeNames: []*string{
aws.String(sqs.QueueAttributeNameAll),
},
QueueUrl: aws.String(queueURL),
MaxNumberOfMessages: aws.Int64(10),
VisibilityTimeout: aws.Int64(36000), // 10 hours
WaitTimeSeconds: aws.Int64(0),
})
if err != nil {
fmt.Println("Error", err)
return nil, err
}
msgs := result.Messages
if len(msgs) == 0 {
fmt.Println("Received no messages")
return msgs, err
}
return msgs, nil
}
The test case for this Source file is here:
package sqs
import (
"errors"
"testing"
"path_to_the_mocks_package/mocks"
"github.com/golang/mock/gomock"
"github.com/aws/aws-sdk-go/service/sqs"
)
func TestPollWhenNoMessageOnQueue(t *testing.T) {
mockCtrl := gomock.NewController(t)
defer mockCtrl.Finish()
msgr := mocks.NewMockMessenger(mockCtrl)
mq := make([]*sqs.Message, 1)
err := errors.New("Mock Error")
// msgr.EXPECT().GetMessage().Return(mq, err) //.Times(1)
// msgr.GetMessage().Return(mq, err) //.Times(1)
msgr.EXPECT().GetMessage().Return(mq, err)
p := NewPoller(msgr)
done := make(chan bool)
go p.Poll(done)
<-done
t.Logf("Successfully done: %v", done)
}
When I run the tests I am getting the following error:
sqs\controller.go:150: Unexpected call to
*mocks.MockMessenger.GetMessage([]) at path_to_mocks_package/mocks/mock_messenger.go:38 because: Expected
call at path_to_sqs_package/sqs/sqs_test.go:35 has already been called
the max number of times. FAIL
If I write my own mock as follows the test case executes successfully:
type mockMessenger struct {
mock.Mock
}
func (m *mockMessenger) GetMessage() ([]*sqs.Message, error) {
msgs := make([]*sqs.Message, 0)
err := errors.New("Error")
return msgs, err
}

You are implicitly telling gomock that you only expect a single call.
msgr.EXPECT().GetMessage().Return(mq, err)
Adding a number of Times to the mock, allows you to return those values more than once.
msgr.EXPECT().GetMessage().Return(mq, err).AnyTimes()
For more details please read the gomock's AnyTimes documentation.

how can I reduce cpu usage in a golang tcp server?

I try to implement a golang tcp server, and I found the concurrency is satisfied for me, but the CPU usage is too high(concurrency is 15W+/s, but the CPU usage is about 800% in a 24 cores linux machine). At the same time, a C++ tcp server is only about 200% usage with a similar concurrency(with libevent).
The following code is the demo of golang:
func main() {
listen, err := net.Listen("tcp", "0.0.0.0:17379")
if err != nil {
fmt.Errorf(err.Error())
}
go acceptClient(listen)
var channel2 = make(chan bool)
<-channel2
}
func acceptClient(listen net.Listener) {
for {
sock, err := listen.Accept()
if err != nil {
fmt.Errorf(err.Error())
}
tcp := sock.(*net.TCPConn)
tcp.SetNoDelay(true)
var channel = make(chan bool, 10)
go read(channel, sock.(*net.TCPConn))
go write(channel, sock.(*net.TCPConn))
}
}
func read(channel chan bool, sock *net.TCPConn) {
count := 0
for {
var buf = make([]byte, 1024)
n, err := sock.Read(buf)
if err != nil {
close(channel)
sock.CloseRead()
return
}
count += n
x := count / 58
count = count % 58
for i := 0; i < x; i++ {
channel <- true
}
}
}
func write(channel chan bool, sock *net.TCPConn) {
buf := []byte("+OK\r\n")
defer func() {
sock.CloseWrite()
recover()
}()
for {
_, ok := <-channel
if !ok {
return
}
_, writeError := sock.Write(buf)
if writeError != nil {
return
}
}
}
And I test this tcp server by the redis-benchmark with multi-clients:
redis-benchmark -h 10.100.45.2 -p 17379 -n 1000 -q script load "redis.call('set','aaa','aaa')"
I also analyzed my golang code by the pprof, it is said CPU cost a lot of time on syscall:
enter image description here

I don't think parallelise the read and write with channel will provide you better performance in this case. You should try to do less memory allocation and less syscall (The write function may do a lot of syscalls)
Can you try this version?
package main
import (
"bytes"
"fmt"
"net"
)
func main() {
listen, err := net.Listen("tcp", "0.0.0.0:17379")
if err != nil {
fmt.Errorf(err.Error())
}
acceptClient(listen)
}
func acceptClient(listen net.Listener) {
for {
sock, err := listen.Accept()
if err != nil {
fmt.Errorf(err.Error())
}
tcp := sock.(*net.TCPConn)
tcp.SetNoDelay(true)
go handleConn(tcp) // less go routine creation but no concurrent read/write on the same conn
}
}
var respPattern = []byte("+OK\r\n")
// just one goroutine per conn
func handleConn(sock *net.TCPConn) {
count := 0
buf := make([]byte, 4098) // Do not create a new buffer each time & increase the buff size
defer sock.Close()
for {
n, err := sock.Read(buf)
if err != nil {
return
}
count += n
x := count / 58
count = count % 58
resp := bytes.Repeat(respPattern, x) // can be optimize
_, writeError := sock.Write(resp) // do less syscall
if writeError != nil {
return
}
}
}

GoLang FanIn function not working

So I'm trying to write a webcrawler using Rob Pike's fanin function.
This is my code -
package main
import (
"net/http"
"encoding/json"
"fmt"
"io/ioutil"
)
func main() {
fanIn(getDuckDuckGo("food"), getGitHub("defunkt"))
}
type DuckDuckGoResponse struct {
RelatedTopics []struct {
Result string `json:"Result"`
FirstUrl string `json:"FirstURL"`
Text string `json:"Text"`
} `json:"RelatedTopics"`
}
type GitHubResponse struct {
Login string `json:"login"`
Email string `json:"email"`
Name string `json:"name"`
}
func fanIn(input1 <-chan DuckDuckGoResponse, input2 <-chan GitHubResponse) <-chan string {
c := make(chan string)
go func() {
for {
select {
case s := <-input1:
fmt.Println(s)
case s := <-input2:
fmt.Println(s)
}
}
}()
return c
}
func getDuckDuckGo(k string) <-chan DuckDuckGoResponse {
resp, err := http.Get("http://api.duckduckgo.com/?q=" + k + "&format=json&pretty=1")
if err != nil {
return nil
}
c := make(chan DuckDuckGoResponse)
var duckDuckParsed DuckDuckGoResponse
jsonDataFromHttp, jsonErr := ioutil.ReadAll(resp.Body)
if jsonErr != nil {
fmt.Println("Json error!")
}
defer resp.Body.Close()
if err:= json.Unmarshal(jsonDataFromHttp, &duckDuckParsed); err != nil {
panic(err)
}
return c
}
func getGitHub(k string) <-chan GitHubResponse {
resp, err := http.Get("https://api.github.com/users/?q=" + k)
if err != nil {
return nil
}
c := make(chan GitHubResponse)
var githubParsed GitHubResponse
jsonDataFromHttp, jsonErr := ioutil.ReadAll(resp.Body)
if jsonErr != nil {
fmt.Println("Json error!")
}
defer resp.Body.Close()
if err:= json.Unmarshal(jsonDataFromHttp, &githubParsed); err != nil {
panic(err)
}
return c
}
I run this program, and nothing prints.
Why?
Thanks

At first glance, the fanIn function returns a channel that is not being read from in your main loop. So yes, you are invoking the fanIn function which returns a channel, but there is nothing reading off of that channel. For a channel to be useful there needs to be a consumer consuming from the channel while on the other end there needs to be a producer producing on that channel. In other words, sending on a channel can't make progress unless someone on the other end is ready to receive on it.
Next, your getGitHub and getDuckDuckGo return channels, but they don't actually send anything on those channels that they return. Also, what you really need is a way to invoke those functions, have them return a channel and still execute your work. You need to use additional goroutines in order be able to have the http.Get calls do their work.
Lastly, your fanIn function also creates a channel and returns it, however it doesn't actually "fan-in" the results from input1 or input2. Since the fanIn returns a channel of type string, you'll need to write a string into them which could be a field off of DuckDuckGoResponse and GitHubResponse.
I urge you too look at this streamlined example of what you are trying to accomplish: https://talks.golang.org/2012/go-docs/faninboring.go
Last observation you are checking that jsonErr != nil and printing it out but you probably want to return nil as well to prevent the code from continuing on.
I hope this gives you just enough insight to get your code working. Good luck!

Why does the function return early?

I've just started learning go, and have been working through the tour. The last exercise is to edit a web crawler to crawl in parallel and without repeats.
Here is the link to the exercise: http://tour.golang.org/#70
Here is the code. I only changed the crawl and the main function. So I'll just post those to keep it neat.
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
var used = make(map[string]bool)
var urlchan = make(chan string)
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// Done: Don't fetch the same URL twice.
// This implementation doesn't do either:
done := make(chan bool)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
go func() {
for _, i := range urls {
urlchan <- i
}
done <- true
}()
for u := range urlchan {
if used[u] == false {
used[u] = true
go Crawl(u, depth-1, fetcher)
}
if <-done == true {
break
}
}
return
}
func main() {
used["http://golang.org/"] = true
Crawl("http://golang.org/", 4, fetcher)
}
The problem is that when I run the program the crawler stops after printing
not found: http://golang.org/cmd/
This only happens when I try to make the program run in parallel. If I have it run linearly then all the urls are found correctly.
Note: If I am not doing this right (parallelism I mean) then I apologise.

Be careful with goroutine.
Because when the main routine, or main() func, returns, all others go routine would be killed immediately.
Your Crawl() seems like recursive, however it is not, which means it would return immediately, not awaiting for other Crawl() routines. And you know that if the first Crawl(), called by main(), returns, the main() func regards its mission fulfilled.
What you could do is to let main() func wait until the last Crawl() returns. The sync package, or a chan would help.
You could probably take a look at the last solution of this, which I did months ago:
var store map[string]bool
func Krawl(url string, fetcher Fetcher, Urls chan []string) {
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
} else {
fmt.Printf("found: %s %q\n", url, body)
}
Urls <- urls
}
func Crawl(url string, depth int, fetcher Fetcher) {
Urls := make(chan []string)
go Krawl(url, fetcher, Urls)
band := 1
store[url] = true // init for level 0 done
for i := 0; i < depth; i++ {
for band > 0 {
band--
next := <- Urls
for _, url := range next {
if _, done := store[url] ; !done {
store[url] = true
band++
go Krawl(url, fetcher, Urls)
}
}
}
}
return
}
func main() {
store = make(map[string]bool)
Crawl("http://golang.org/", 4, fetcher)
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Go webcrawler hangs after checking about 2000 urls - concurrency

Related

How to solve this syncing issue in Unit testing golang?

Testing with Gomock returns error: Expected call has already been called the max number of times

how can I reduce cpu usage in a golang tcp server?

GoLang FanIn function not working

Why does the function return early?

Categories

Resources