From 23f74073667b79c790c81a6f8efcc1b28ed03acc Mon Sep 17 00:00:00 2001 From: haskal Date: Sat, 11 Apr 2020 04:22:05 -0400 Subject: [PATCH] Update readme; update debug prints --- README.md | 18 +++++++++++++++--- private/util.rkt | 5 +++-- webcrawler | 2 +- 3 files changed, 19 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 7ba8023..257939a 100644 --- a/README.md +++ b/README.md @@ -8,12 +8,24 @@ ## High level approach -todo +We started by creating robust abstracted HTTP-handling code, which is located in the `smol-http` +module of this project. The HTTP code implements a subset of HTTP 1.1 which is enough to meet the +requirements for crawling the target web server. It also uses plain TCP sockets to communicate using +its HTTP implementation. We used Racket standard library functions to parse and manipulate URLs as +well as parse HTML (as XML, hopefully it's well-formed!) in order to find the hyperlinks on the page +as well as the flags. We implemented a high performance Certified Web Scale(tm) crawling scheduler +with a distributed work queue to allow for very high rate crawling, the crawler on our machines +takes minutes to complete, and finds all the flags very quickly. ## Challenges -todo +The current pandemic situation continues to make this semester difficult. Otherwise, we didn't run +into any major issues during this project. ## Testing -todo +We unit tested the HTTP handling code in smol-http, and used ad-hoc manual testing against the +target server to test the complete crawling functionality. + +We have an additional `-d` flag which will print useful debug info during the execution of the +crawler, which may be helpful for manual testing. diff --git a/private/util.rkt b/private/util.rkt index 67e61f3..43b6009 100644 --- a/private/util.rkt +++ b/private/util.rkt @@ -29,9 +29,10 @@ ;; -> ;; Prints a completion message to the console, only when debug mode is on -(define (print-complete) +(define (print-complete total-pages num-flags) (when (debug-mode?) - (printf "\r\x1b[KCrawl complete\n"))) + (printf "\r\x1b[KCrawl complete: ~a pages crawled, ~a flags found\n" + total-pages num-flags))) ;; Str -> ;; Prints a flag diff --git a/webcrawler b/webcrawler index bfcb466..26dc833 100755 --- a/webcrawler +++ b/webcrawler @@ -157,7 +157,7 @@ (set-count completed) (unbox num-flags)) (loop))) - (print-complete) + (print-complete (set-count completed) (unbox num-flags)) ;; send all workers the shutdown message and wait (for ([thd (in-vector worker-threads)]) (thread-send thd #f)