In the previous exercise I felt my absence of a formal CompSci background with the introduction of Binary Sorted Trees, and now I am concious of it again with learning about mutex. I’d heard of them before, mostly when Oracle performance folk were talking about wait types - TIL it stands for mutual exclusion
!
What if we … want to make sure only one goroutine can access a variable at a time to avoid conflicts?
This concept is called mutual exclusion, and the conventional name for the data structure that provides it is mutex.
Exercise: Web Crawler
This was quite a fun one once I wrapped my head around it. It gave a health dose of copy & paste advice in the form of the previous example which I used to implement the first requirement.
Don’t fetch the same URL twice
I created a URLs
struct to hold a map of URLs and a boolean of whether they have been crawled or not, and included a mutex so that it can be read and updated safely in concurrent execution
type URLs struct {
c map[string]bool
mux sync.Mutex
}
var u URLs = URLs{c: make(map[string]bool)}
The URLs
type implements two functions - one to check if a given URL has been crawled, and the other to mark it as such
func (u URLs) IsCrawled(url string) bool {
fmt.Printf("\n👀 Checking if %v has been crawled…", url)
u.mux.Lock()
defer u.mux.Unlock()
if _, ok := u.c[url]; ok == false {
fmt.Printf("…it hasn't\t")
return false
}
fmt.Printf("…it has\t")
return true
}
func (u URLs) Crawled(url string) {
u.mux.Lock()
u.c[url] = true
u.mux.Unlock()
}
To the main Crawl
function I then added calls to these functions and a conditional return:
// Check if the URL has been crawled already
if u.IsCrawled(url) == true {
return
}
fmt.Printf("\n➡️ Crawling %v", url)
body, urls, err := fetcher.Fetch(url)
// Mark the URL as crawled (assumes that if there's an error you don't want to retry it)
u.Crawled(url)
As the comment notes, we assume that if a URL has been crawled, then we mark it as such, regardless of error status. If I was feeling adventurous I guess I could implement some kind of retry logic with incremental backoff…but I’m keeping it simple for now :)
Fetch URLs in parallel
This one I assumed could be done by simply using a Go routine in calling the nested Crawl
functions. What it actually did was just fetch the first URL and exit
-> Checking if https://golang.org/ has been crawled……it hasn't
found: https://golang.org/ "The Go Programming Language"
Off to Google we went and found this answer on StackOverflow which showed the use of WaitGroups (nice example here). I ripped this off shamelessly into my code and it almost worked…
👀 Checking if https://golang.org/ has been crawled……it hasn't
➡️ Crawling https://golang.org/
->âś… found: https://golang.org/ "The Go Programming Language"
👀 Checking if https://golang.org/pkg/ has been crawled……it hasn't
➡��� Crawling https://golang.org/pkg/
->âś… found: https://golang.org/pkg/ "Packages"
👀 Checking if https://golang.org/ has been crawled……it has
👀 Checking if https://golang.org/cmd/ has been crawled……it hasn't
➡️ Crawling https://golang.org/cmd/
->⚠️ not found: https://golang.org/cmd/
👀 Checking if https://golang.org/pkg/fmt/ has been crawled……it hasn't
➡️ Crawling https://golang.org/pkg/fmt/
->âś… found: https://golang.org/pkg/fmt/ "Package fmt"
👀 Checking if https://golang.org/ has been crawled……it has
👀 Checking if https://golang.org/pkg/ has been crawled……it has
👀 Checking if https://golang.org/pkg/os/ has been crawled……it hasn't
➡️ Crawling https://golang.org/pkg/os/
->âś… found: https://golang.org/pkg/os/ "Package os"
👀 Checking if https://golang.org/ has been crawled……it has
👀 Checking if https://golang.org/pkg/ has been crawled……it has
👀 Checking if https://golang.org/cmd/ has been crawled……it has
but then threw a panic
panic: sync: negative WaitGroup counter
goroutine 1 [running]:
sync.(*WaitGroup).Add(0xc0000a4010, 0xffffffffffffffff)
/usr/local/go/src/sync/waitgroup.go:74 +0x1ec
sync.(*WaitGroup).Done(0xc0000a4010)
/usr/local/go/src/sync/waitgroup.go:99 +0x34
main.Crawl(0x1100e8c, 0x13, 0x4, 0x110fb60, 0xc0000801b0, 0xc0000a4010)
/Users/rmoff/go/src/webcrawler/webcrawler.go:68 +0x676
main.main()
/Users/rmoff/go/src/webcrawler/webcrawler.go:73 +0x98
A bit of Googling showed that panic: sync: negative WaitGroup counter
(as the error actually suggests) comes about because Done had been called to decrease the number of WaitGroups and taken them below zero.
This happened because every execution of Crawl
includes
defer wg.Done()
but the corresponding
wg.Add(1)
was only added in the nested call to Crawl
and not the initial invocation from main()
. Adding this into main()
then made everything work just great.
func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
defer wg.Done()
if depth <= 0 {
return
}
// Check if the URL has been crawled already
if u.IsCrawled(url) == true {
return
}
fmt.Printf("\n➡️ Crawling %v", url)
body, urls, err := fetcher.Fetch(url)
// Mark the URL as crawled (assumes that if there's an error you don't want to retry it)
u.Crawled(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("\n\t->âś… found: %s %q\n", url, body)
for _, z := range urls {
wg.Add(1)
Crawl(z, depth-1, fetcher, wg)
}
}
func main() {
wg := &sync.WaitGroup{}
wg.Add(1)
Crawl("https://golang.org/", 4, fetcher, wg)
wg.Wait()
}