« I'm a speaker at the Gotham Ruby Conference! | Main | Slides and Outline for talk on Classifiers with Ruby »

April 22, 2007

Categorizing Documents using Ruby

This is the post with links to resources and materials that I promised at the end of my talk at Gotham Ruby Conf. First, I need to attribute the images I used in my slides to the soon to be released book An Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It's a great resource for anyone looking to learn the basics of classification, clustering, latent semantic indexing, supervised machine learning, and unsupervised machine learning as they pertain to documents. It also has a bunch of good stuff about search and building an index.

I've put up a zip of the code I referenced in my talk. The four pieces of interest are the Document, FeatureSelector, FeatureExtractor, and NaiveBayes classes. Some time in the next month I'm going to make it a little more generic and package it up as a gem. This code is specific to a project I did for my search engine technologies course in which I had to output stuff to work with SVM Light.

Thanks a bunch to the organizers and Google!  Thanks also to the guys at Indaba Music for throwing such a great after party! I'll post more later including the outline I created for the talk, but right now I have to get back to recovering from last night's revelry.

Update: I've noticed that there are a few bugs in the code. So if you're planning on using it I just wanted to give a warning. I have to check something in the FeatureExtractor and something in the FeatureSelector. Some of the features being selected are off and apparently there are dupes showing up in the extracted features. I should be updating it sometime in the next week.

technorati tags:, , ,

Comments

Thanks for posting the material, Paul. I'm eager to experiment with it. Can you double check the chi_squared.rb file? I may be missing something, but it looks like it got overwritten with an rdoc file... Thanks!

Hmm, so it did. I just uploaded a fixed zip. I also included the training and test data sets in this one so you'll be able to see it run. Again, I give the warning that some of it is rough and there are a few known bugs that I will address soon.

Cheers,
Paul

The comments to this entry are closed.