2020-04-11 08:08:13 +00:00
|
|
|
|
|
|
|
__ __ __
|
|
|
|
__/ // /_/ /___ _____ ____ _ __ ___ ____ __
|
|
|
|
/_ _ __/ / __ `/ __ \/ __ `/ / / / / | /| / / / / /
|
|
|
|
/_ _ __/ / /_/ / / / / /_/ / / /_/ /| |/ |/ / /_/ /
|
|
|
|
/_//_/ /_/\__,_/_/ /_/\__, / \__,_/ |__/|__/\__,_/
|
|
|
|
/____/
|
|
|
|
|
|
|
|
## High level approach
|
|
|
|
|
2020-04-11 08:22:05 +00:00
|
|
|
We started by creating robust abstracted HTTP-handling code, which is located in the `smol-http`
|
|
|
|
module of this project. The HTTP code implements a subset of HTTP 1.1 which is enough to meet the
|
|
|
|
requirements for crawling the target web server. It also uses plain TCP sockets to communicate using
|
|
|
|
its HTTP implementation. We used Racket standard library functions to parse and manipulate URLs as
|
|
|
|
well as parse HTML (as XML, hopefully it's well-formed!) in order to find the hyperlinks on the page
|
|
|
|
as well as the flags. We implemented a high performance Certified Web Scale(tm) crawling scheduler
|
|
|
|
with a distributed work queue to allow for very high rate crawling, the crawler on our machines
|
|
|
|
takes minutes to complete, and finds all the flags very quickly.
|
2020-04-11 08:08:13 +00:00
|
|
|
|
|
|
|
## Challenges
|
|
|
|
|
2020-04-11 08:22:05 +00:00
|
|
|
The current pandemic situation continues to make this semester difficult. Otherwise, we didn't run
|
|
|
|
into any major issues during this project.
|
2020-04-11 08:08:13 +00:00
|
|
|
|
|
|
|
## Testing
|
|
|
|
|
2020-04-11 08:22:05 +00:00
|
|
|
We unit tested the HTTP handling code in smol-http, and used ad-hoc manual testing against the
|
|
|
|
target server to test the complete crawling functionality.
|
|
|
|
|
|
|
|
We have an additional `-d` flag which will print useful debug info during the execution of the
|
|
|
|
crawler, which may be helpful for manual testing.
|