This is the post with links to resources and materials that I promised at the end of my talk at Gotham Ruby Conf. First, I need to attribute the images I used in my slides to the soon to be released book An Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It's a great resource for anyone looking to learn the basics of classification, clustering, latent semantic indexing, supervised machine learning, and unsupervised machine learning as they pertain to documents. It also has a bunch of good stuff about search and building an index.
I've put up a zip of the code I referenced in my talk. The four pieces of interest are the Document, FeatureSelector, FeatureExtractor, and NaiveBayes classes. Some time in the next month I'm going to make it a little more generic and package it up as a gem. This code is specific to a project I did for my search engine technologies course in which I had to output stuff to work with SVM Light.
Thanks a bunch to the organizers and Google! Thanks also to the guys at Indaba Music for throwing such a great after party! I'll post more later including the outline I created for the talk, but right now I have to get back to recovering from last night's revelry.
Update: I've noticed that there are a few bugs in the code. So if you're planning on using it I just wanted to give a warning. I have to check something in the FeatureExtractor and something in the FeatureSelector. Some of the features being selected are off and apparently there are dupes showing up in the extracted features. I should be updating it sometime in the next week.
technorati tags:goruco2007, goruco, naivebayes, ruby