CS3700-project4

Go to file

xenia 57a2e1dbac Fix makefile		2020-04-12 02:03:16 -04:00
private	Add comments (and now it's sleepy time :iitalics: :blobcatsleepreach:)	2020-04-11 04:57:08 -04:00
smol-http	Add crawler handling for correct urls, response codes 301, 302, 403, 404, 500	2020-04-11 02:43:12 -04:00
.gitignore	Command line	2020-04-10 19:12:50 -04:00
Makefile	Fix makefile	2020-04-12 02:03:16 -04:00
README.md	Update readme; update debug prints	2020-04-11 04:22:05 -04:00
secret_flags	flags get	2020-04-12 01:49:16 -04:00
webcrawler	Add comments (and now it's sleepy time :iitalics: :blobcatsleepreach:)	2020-04-11 04:57:08 -04:00

README.md

      __ __  __
   __/ // /_/ /___ _____  ____ _   __  ___      ____  __
  /_  _  __/ / __ `/ __ \/ __ `/  / / / / | /| / / / / /
 /_  _  __/ / /_/ / / / / /_/ /  / /_/ /| |/ |/ / /_/ /
  /_//_/ /_/\__,_/_/ /_/\__, /   \__,_/ |__/|__/\__,_/
                       /____/

High level approach

We started by creating robust abstracted HTTP-handling code, which is located in the smol-http module of this project. The HTTP code implements a subset of HTTP 1.1 which is enough to meet the requirements for crawling the target web server. It also uses plain TCP sockets to communicate using its HTTP implementation. We used Racket standard library functions to parse and manipulate URLs as well as parse HTML (as XML, hopefully it's well-formed!) in order to find the hyperlinks on the page as well as the flags. We implemented a high performance Certified Web Scale(tm) crawling scheduler with a distributed work queue to allow for very high rate crawling, the crawler on our machines takes minutes to complete, and finds all the flags very quickly.

Challenges

The current pandemic situation continues to make this semester difficult. Otherwise, we didn't run into any major issues during this project.

Testing

We unit tested the HTTP handling code in smol-http, and used ad-hoc manual testing against the target server to test the complete crawling functionality.

We have an additional -d flag which will print useful debug info during the execution of the crawler, which may be helpful for manual testing.