I've had a busy week so progress has been slow on CorpusCrawler. However, I did get a few things done so here's my writeup on what I've accomplished so far. It's been quite the adjustment trying to write in Python. I haven't yet learned how to think in Python so things are a little slow right now. I did take advantage of the interactive shell. It was nice to be able to work with and manipulate the data on the fly. I created a few modules to store my code, but found this caused a little problem with the whole "on-the-fly" concept. Every time I made a change to one of the modules to fix a bug or extend the functionality, I had to restart the shell and re-import them to force a recompile. I wonder if there's an easier way to do this. Anyway, let me stay on the topic of CorpusCrawler.
The two modules I wrote can be downloaded here. The code is rough since I'm writing this whole thing as a proof of concept project. The code I wrote to provide the numbers and analysis was all done in the interactive shell after I ran through a test crawl.
At this point the only thing this really does is the crawling portion. To kick it off, I give it a starting web site and the number of pages to download. It downloads the raw HTML for each one and parses through it looking for URLs to go to. I add it to the queue of sites to check if it's one that hasn't already been crawled and it's not already in the queue. I used a queue so that the crawl would be breadth first (get the links closest to the source). I also pull out all the text outside of HTML tags and add it to a list of data for the site object. Once the crawl is complete I have everything in memory and I can play around with it a little bit to see how everything turned out.
For my initial run, I told it to go to http://news.google.com and to download 500 pages. I didn't time the operation, but I'd guess it took around 10 minutes. I know this isn't very fast and I'm sure most of the time was spent connecting to sites or downloading the HTML. The processor utilization didn't even come close to maxing out during this whole process. Four of the pages it tried to download produced errors. I'll have to look into those further to figure out what happened. Anyway, here are the initial numbers:
- pages crawled: 496
- unique links in queue: 34063
- unique words: 42176
- occurrences of "the": 17581
- occurrences of "obtuse": 0
As you can see, there's no shortage of sites to crawl. I ensure that all of those in the queue are unique so I can be pretty sure that I'm getting new content each time. The numbers on words are really just trash at this point. I need to come up with methods of pulling the meaty sections of text out of a page, or just discarding the page completely if there's nothing useful there. I'm not sure how to do that at this point but that should be my primary focus for the next few weeks. Some other interesting numbers to note. The word "bush" shows up 715 times. I guess that's a symptom of crawling mostly news sites. There are 15776 words that only show up once. I'm calling them words, but they could be anything at this point. Here's one example of a so called word: "=42968d1df378b715". Obviously, I'm going to have to do some work to get this thing to be more intelligent. Also, I said that I was only worried about English, but I noticed that some of the words pulled back were in other languages. Another thing I'll have to worry about.
Other than the intelligence of the program, a large problem I'll have to tackle is performance. I'm currently keeping everything in memory for these test runs. This will have to change if I'm ever going to get this thing to crawl a large number of sites (100,000+). Another thing about performance that worries me is that I may be spending to much time just connecting to different sites. Would the program benefit from multi-threading so that multiple connections are being brought up simultaneously? I can't be sure without running through some timed test runs on a consistent set of data. This means I'd have to set up a series of sites to loop through on a local web server. It's the only way I can be sure that I'm getting the same sites for every timed test run. I'm going to put that idea on hold for now. In the near term, it's ok if I have to run the program for a 24 hour period to pull back a large enough set of data. As long as I store the data locally to manipulate it later. At this point I just want to pull back enough pages so that I can start figuring out how to pull out the meaningful text.
Ok, now I have to give myself some homework to get done in the following week. Here it is in bullets:
- get the crawler to store data locally
- give the crawler a resume capability so it can pick up where it left off
- run a test crawl of at least 20,000 pages
I think that's conservative enough to get done by next Wednesday. I'll post all about it then.