Exercise: Web Crawler - concurrency not working - concurrency

I am going through the golang tour and working on the final exercise to change a web crawler to crawl in parallel and not repeat a crawl ( http://tour.golang.org/#73 ). All I have changed is the crawl function.
var used = make(map[string]bool)
func Crawl(url string, depth int, fetcher Fetcher) {
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
for _,u := range urls {
if used[u] == false {
used[u] = true
Crawl(u, depth-1, fetcher)
}
}
return
}
In order to make it concurrent I added the go command in front of the call to the function Crawl, but instead of recursively calling the Crawl function the program only finds the "http://golang.org/" page and no other pages.
Why doesn't the program work when I add the go command to the call of the function Crawl?

The problem seems to be, that your process is exiting before all URLs can be followed
by the crawler. Because of the concurrency, the main() procedure is exiting before
the workers are finished.
To circumvent this, you could use sync.WaitGroup:
func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
defer wg.Done()
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
for _,u := range urls {
if used[u] == false {
used[u] = true
wg.Add(1)
go Crawl(u, depth-1, fetcher, wg)
}
}
return
}
And call Crawl in main as follows:
func main() {
wg := &sync.WaitGroup{}
Crawl("http://golang.org/", 4, fetcher, wg)
wg.Wait()
}
Also, don't rely on the map being thread safe.

Here's an approach, again using sync.WaitGroup but wrapping the fetch function in a anonymous goroutine. To make the url map thread safe (meaning parallel threads can't access and change values at the same time) one should wrap the url map in a new type with a sync.Mutex type included i.e. the fetchedUrls type in my example and use the Lock and Unlock methods while the map is being searched/updated.
type fetchedUrls struct {
urls map[string]bool
mux sync.Mutex
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, used fetchedUrls, wg *sync.WaitGroup) {
if depth <= 0 {
return
}
used.mux.Lock()
if used.urls[url] == false {
used.urls[url] = true
wg.Add(1)
go func() {
defer wg.Done()
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
Crawl(u, depth-1, fetcher, used, wg)
}
return
}()
}
used.mux.Unlock()
return
}
func main() {
wg := &sync.WaitGroup{}
used := fetchedUrls{urls: make(map[string]bool)}
Crawl("https://golang.org/", 4, fetcher, used, wg)
wg.Wait()
}
Output:
found: https://golang.org/ "The Go Programming Language"
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ "Packages"
found: https://golang.org/pkg/os/ "Package os"
found: https://golang.org/pkg/fmt/ "Package fmt"
Program exited.

I created my 2 implementations(different concurrency designs) of the same here.
it also uses a thread-safe map
playground link

Related

How to solve this syncing issue in Unit testing golang?

Here when I'm printing the activity it is printing them in the order they are getting created but at the time of assertion it is picking up activities in random order and also it is picking expected values in random order. The api that I'm calling mastercontroller have some goroutines and could take time maybe that is the reason but not sure.
for i, param := range params {
gin.SetMode(gin.TestMode)
w := httptest.NewRecorder()
ctx, _ := gin.CreateTestContext(w)
ctx.Request = &http.Request{
URL: &url.URL{},
Header: make(http.Header),
}
MockJsonPost(ctx, param)
MasterController(ctx)
time.Sleep(3 * time.Second)
fmt.Println("response body", string(w.Body.Bytes()))
fmt.Println("status", w.Code)
// var activity *activity.Activity
activity, err := activityController.GetLastActivity(nil)
//tx.Raw("select * from activity order by id desc limit 1").Find(&activity)
if err != nil {
fmt.Println("No activity found")
}
activityJson, err := activity.ToJsonTest()
if err != nil {
fmt.Println("error converting in json")
}
fmt.Printf("reponse activity %+v", string(activityJson))
assert.EqualValues(t, string(expected[i]), string(activityJson))
}
func MockJsonPost(c *gin.Context, content interface{}) {
c.Request.Method = "POST" // or PUT
c.Request.Header.Set("Content-Type", "application/json")
jsonbytes, err := json.Marshal(content)
if err != nil {
panic(err)
}
// the request body must be an io.ReadCloser
// the bytes buffer though doesn't implement io.Closer,
// so you wrap it in a no-op closer
c.Request.Body = io.NopCloser(bytes.NewBuffer(jsonbytes))
}

DynamoDB list all backups using AWS GoLang SDK

Based on the example given in the link blow on API Operation Pagination without Callbacks
https://aws.amazon.com/blogs/developer/context-pattern-added-to-the-aws-sdk-for-go/
I am trying to list all the Backups in dynamodb. But it seems like pagination is not working and it is just retrieving first page and not going to next page
package main
import (
"context"
"fmt"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/request"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/dynamodb"
)
func main() {
sess, sessErr := session.NewSession()
if sessErr != nil {
fmt.Println(sessErr)
fmt.Println("Cound not initilize session..returning..")
return
}
// Create DynamoDB client
dynamodbSvc := dynamodb.New(sess)
params := dynamodb.ListBackupsInput{}
ctx := context.Background()
p := request.Pagination{
NewRequest: func() (*request.Request, error) {
req, _ := dynamodbSvc.ListBackupsRequest(&params)
req.SetContext(ctx)
return req, nil
},
}
for p.Next() {
page := p.Page().(*dynamodb.ListBackupsOutput)
fmt.Println("Received", len(page.BackupSummaries), "objects in page")
for _, obj := range page.BackupSummaries {
fmt.Println(aws.StringValue(obj.BackupName))
}
}
//return p.Err()
} //end of main
Its a bit late but I'll just put it here in case I can help somebody.
Example:
var exclusiveStartARN *string
var backups []*dynamodb.BackupSummary
for {
backup, err := svc.ListBackups(&dynamodb.ListBackupsInput{
ExclusiveStartBackupArn:exclusiveStartARN,
})
if err != nil {
fmt.Println(err)
os.Exit(1)
}
backups = append(backups, backup.BackupSummaries...)
if backup.LastEvaluatedBackupArn != nil {
exclusiveStartARN = backup.LastEvaluatedBackupArn
//max 5 times a second so we dont hit the limit
time.Sleep(200 * time.Millisecond)
continue
}
break
}
fmt.Println(len(backups))
Explaination:
The way that pagination is done is via ExclusiveStartBackupArn in the ListBackupsRequest. The ListBackupsResponse returns LastEvaluatedBackupArn if there are more pages, or nil if its the last/only page.
It could be that you're smashing into the API a bit with your usage
You can call ListBackups a maximum of 5 times per second.
What is the value of p.HasNextPage() in your p.Next() loop?

GoLang FanIn function not working

So I'm trying to write a webcrawler using Rob Pike's fanin function.
This is my code -
package main
import (
"net/http"
"encoding/json"
"fmt"
"io/ioutil"
)
func main() {
fanIn(getDuckDuckGo("food"), getGitHub("defunkt"))
}
type DuckDuckGoResponse struct {
RelatedTopics []struct {
Result string `json:"Result"`
FirstUrl string `json:"FirstURL"`
Text string `json:"Text"`
} `json:"RelatedTopics"`
}
type GitHubResponse struct {
Login string `json:"login"`
Email string `json:"email"`
Name string `json:"name"`
}
func fanIn(input1 <-chan DuckDuckGoResponse, input2 <-chan GitHubResponse) <-chan string {
c := make(chan string)
go func() {
for {
select {
case s := <-input1:
fmt.Println(s)
case s := <-input2:
fmt.Println(s)
}
}
}()
return c
}
func getDuckDuckGo(k string) <-chan DuckDuckGoResponse {
resp, err := http.Get("http://api.duckduckgo.com/?q=" + k + "&format=json&pretty=1")
if err != nil {
return nil
}
c := make(chan DuckDuckGoResponse)
var duckDuckParsed DuckDuckGoResponse
jsonDataFromHttp, jsonErr := ioutil.ReadAll(resp.Body)
if jsonErr != nil {
fmt.Println("Json error!")
}
defer resp.Body.Close()
if err:= json.Unmarshal(jsonDataFromHttp, &duckDuckParsed); err != nil {
panic(err)
}
return c
}
func getGitHub(k string) <-chan GitHubResponse {
resp, err := http.Get("https://api.github.com/users/?q=" + k)
if err != nil {
return nil
}
c := make(chan GitHubResponse)
var githubParsed GitHubResponse
jsonDataFromHttp, jsonErr := ioutil.ReadAll(resp.Body)
if jsonErr != nil {
fmt.Println("Json error!")
}
defer resp.Body.Close()
if err:= json.Unmarshal(jsonDataFromHttp, &githubParsed); err != nil {
panic(err)
}
return c
}
I run this program, and nothing prints.
Why?
Thanks
At first glance, the fanIn function returns a channel that is not being read from in your main loop. So yes, you are invoking the fanIn function which returns a channel, but there is nothing reading off of that channel. For a channel to be useful there needs to be a consumer consuming from the channel while on the other end there needs to be a producer producing on that channel. In other words, sending on a channel can't make progress unless someone on the other end is ready to receive on it.
Next, your getGitHub and getDuckDuckGo return channels, but they don't actually send anything on those channels that they return. Also, what you really need is a way to invoke those functions, have them return a channel and still execute your work. You need to use additional goroutines in order be able to have the http.Get calls do their work.
Lastly, your fanIn function also creates a channel and returns it, however it doesn't actually "fan-in" the results from input1 or input2. Since the fanIn returns a channel of type string, you'll need to write a string into them which could be a field off of DuckDuckGoResponse and GitHubResponse.
I urge you too look at this streamlined example of what you are trying to accomplish: https://talks.golang.org/2012/go-docs/faninboring.go
Last observation you are checking that jsonErr != nil and printing it out but you probably want to return nil as well to prevent the code from continuing on.
I hope this gives you just enough insight to get your code working. Good luck!

Go webcrawler hangs after checking about 2000 urls

I have a program to check whether keywords are on a web page. But after checking 1000-3000 urls, it hangs. There is no output, it does not exit, and the number of tcp connections is zero. I don't know why there are no new connections.
Would you give me some advice how to debug it?
type requestReturn struct {
url string
status bool
}
var timeout = time.Duration(800 * time.Millisecond)
func checkUrls(urls []string, kws string, threadLimit int) []string {
limitChan := make(chan int, threadLimit)
ok := make(chan requestReturn, 1)
var result []string
i := 0
for ; i < threadLimit; i++ {
go func(u string) {
request(u, limitChan, ok, kws)
}(urls[i])
}
for o := range ok {
if o.status {
result = append(result, o.url)
log.Printf("success %s,remain %d", o.url, len(urls)-i)
} else {
log.Printf("fail %s,remain %d", o.url, len(urls)-i)
}
if i < len(urls) {
go func(u string) {
request(u, limitChan, ok, kws)
}(urls[i])
i++
}
}
close(limitChan)
return result
}
func dialTimeout(network, addr string) (net.Conn, error) {
return net.DialTimeout(network, addr, timeout)
}
func request(url string, threadLimit chan int, ok chan requestReturn, kws string) {
threadLimit <- 1
log.Printf("%s, start...", url)
//startTime := time.Now().UnixNano()
rr := requestReturn{url: url}
transport := http.Transport{
Dial: dialTimeout,
DisableKeepAlives: true,
}
client := http.Client{
Transport: &transport,
Timeout: time.Duration(15 * time.Second),
}
resp, e := client.Get(url)
if e != nil {
log.Printf("%q", e)
rr.status = false
return
}
if resp.StatusCode == 200 {
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Printf("%q", err)
rr.status = false
return
}
content := bytes.NewBuffer(body).String()
matched, err1 := regexp.MatchString(kws, content)
if err1 != nil {
log.Printf("%q", err1)
rr.status = false
} else if matched {
rr.status = true
log.Println(rr.url)
} else {
rr.status = false
}
} else {
rr.status = false
}
defer (func() {
resp.Body.Close()
ok <- rr
//processed := float32(time.Now().UnixNano()-startTime) / 1e9
//log.Printf("%s, status:%t,time:%.3fs", rr.url, rr.status, processed)
<-threadLimit
})()
}
You seem to be using two forms of concurrency control in this code, and both have problems.
You've got limitChan, which looks like it is being used as a semaphore (request sends a value at its start, and receives a value in a defer in that function). But checkUrls is also trying to make sure it only has threadLimit goroutines running at once (by spawning that number first up, and only spawning more when one reports its results on the ok channel). Only one of these should be necessary to limit the concurrency.
Both methods fail due to the way the defer is set up in request. There are a number of return statements that occur before defer, so it is possible for the function to complete without sending the result to the ok channel, and without freeing up its slot in limitChan. After a sufficient number of errors, checkUrls will stop spawning new goroutines and you'll see your hang.
The fix is to place the defer statement before any of the return statements so you know it will always be run. Something like this:
func request(url string, threadLimit chan int, ok chan requestReturn, kws string) {
threadLimit <- 1
rr := requestReturn{url: url}
var resp *http.Response
defer func() {
if resp != nil {
resp.Body.Close()
}
ok <- rr
<-threadLimit
}()
...
}

Why does the function return early?

I've just started learning go, and have been working through the tour. The last exercise is to edit a web crawler to crawl in parallel and without repeats.
Here is the link to the exercise: http://tour.golang.org/#70
Here is the code. I only changed the crawl and the main function. So I'll just post those to keep it neat.
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
var used = make(map[string]bool)
var urlchan = make(chan string)
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// Done: Don't fetch the same URL twice.
// This implementation doesn't do either:
done := make(chan bool)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\nfound: %s %q\n\n", url, body)
go func() {
for _, i := range urls {
urlchan <- i
}
done <- true
}()
for u := range urlchan {
if used[u] == false {
used[u] = true
go Crawl(u, depth-1, fetcher)
}
if <-done == true {
break
}
}
return
}
func main() {
used["http://golang.org/"] = true
Crawl("http://golang.org/", 4, fetcher)
}
The problem is that when I run the program the crawler stops after printing
not found: http://golang.org/cmd/
This only happens when I try to make the program run in parallel. If I have it run linearly then all the urls are found correctly.
Note: If I am not doing this right (parallelism I mean) then I apologise.
Be careful with goroutine.
Because when the main routine, or main() func, returns, all others go routine would be killed immediately.
Your Crawl() seems like recursive, however it is not, which means it would return immediately, not awaiting for other Crawl() routines. And you know that if the first Crawl(), called by main(), returns, the main() func regards its mission fulfilled.
What you could do is to let main() func wait until the last Crawl() returns. The sync package, or a chan would help.
You could probably take a look at the last solution of this, which I did months ago:
var store map[string]bool
func Krawl(url string, fetcher Fetcher, Urls chan []string) {
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
} else {
fmt.Printf("found: %s %q\n", url, body)
}
Urls <- urls
}
func Crawl(url string, depth int, fetcher Fetcher) {
Urls := make(chan []string)
go Krawl(url, fetcher, Urls)
band := 1
store[url] = true // init for level 0 done
for i := 0; i < depth; i++ {
for band > 0 {
band--
next := <- Urls
for _, url := range next {
if _, done := store[url] ; !done {
store[url] = true
band++
go Krawl(url, fetcher, Urls)
}
}
}
}
return
}
func main() {
store = make(map[string]bool)
Crawl("http://golang.org/", 4, fetcher)
}