I recently did a little project where I tried to train a naive bayes classifier automatically using data from the Technorati API and blog feeds. The specific classification I was trying to make was whether a blog post was related to the Ruby programming community or not.
The basic approach I took was the following.
- Query the Technorati API for a list of blog posts tagged 'ruby'.
- Get the feeds of those blogs.
- Put all the articles from the blogs into a training set with posts tagged 'ruby' in the ruby class and the other posts in the 'not ruby' class. I then changed the training set so there were an equal number of posts in each class.
- Convert the posts into a vector representation using a basic bag of words approach.
- Run 10 fold cross validation testing. This consisted of splitting the training set into 90% training and 10% test. Feature selection and training were performed on the 90%. Then iterate 10 times over different 90% chunks according to the following two steps.
- Perform feature selection using Chi Squared and throw out any features that occur only once. I selected features with a chi value > 3.
- Train the classifier and run the test on the held out 10%.
Here's the output of the test:
total correct: 403
total documents: 530
total accuracy: 76.03%
So those results are a little discouraging at a glance. However, I can think of a few reasons for the poor accuracy.
First is the data set gathered from Technorati. I took a look at some of the articles and noticed that not all of the posts tagged 'ruby' had anything to do with the programming community. There were some that were tagged because people were selling or talking about the actual rocks. At least one was tagged ruby because it was talking about a song of that title. I guess my theory that I could automatically train a classifier from Technorati tagged posts is out the door. Actually, there may be another way to process to include only good data. I could run a clustering algorithm over the set of posts and just pick the cluster of ruby programming posts. It's only an idea at this point. I'm not really sure if that would work or not.
Second is my naive representation of the posts. I represented them as a simple bag of words. So I removed all non-word characters and numbers, stemmed the words and put them into a vector. The problem with removing all those non-word characters is that in the Ruby programming community these have significance. First I converted the post to text via Hpricot's to_plain_text method. Then I passed that to the following regex:
the_plain_text.gsub(/'|@|/, '').gsub(/\W/, ' ').gsub(/[0-9]|\_|/, '').downcase
Much of the effort with improving classification is better document representation. This means returning a better set of features. One I can think of that may improve things is to have the each post include the ratio of ruby to non-ruby posts on the blog it comes from. The theory being that blogs that post frequently and predominately about ruby will continue to do so. While blogs that are all over the map have a lower probability of each post being about ruby. This type of feature could actually be used for classifying blog posts for any topic of interest.
The other set of features are ones that are specific to either programming or ruby itself. Posts about ruby programming often have sections of ruby code in them. If I checked for some of these basic indicators before removing punction I could use them as additional features. For example I could have a feature that counts how many times => occurs in the blog post. The idea being that it's probably about ruby because a ton of ruby code contains that character sequence.
Of course at this point most of this is speculation. The only thing I'm fairly certain of is that I can't train a classifier automatically from Technorati data without further processing. The set of interests and users tagging posts are simply too broad. I have to whittle things down a little.
My next steps from here are to perform some of these tests, gemify the chi squared and classification code so other people can play with this too, and see if I can compile a manually marked set to run tests against. All of this will have to wait for a little bit since I have finals to attend to. If you're curious about this stuff or want to see my code just drop me a line.
technorati tags:ruby, machinelearning, classifiers, naivebayes, technoratiapi
Hey Paul, have you looked at Lingpipe? http://www.alias-i.com/lingpipe/
Might have some interesting parallels in helping you build the classifier.
Posted by: Niki Scevak | May 01, 2007 at 09:39 AM
Nick, I heard of LingPipe a little while ago when I met Breck (the president of alias-i) at BarcampNYC2. I haven't had the chance to play around with it yet. I'm doing everything in Ruby so I guess I'll have to give JRuby a shot if I want to try it.
Part of the thing with this effort is that I want to write the tools in Ruby and make them available for the rest of the community. It also gives me a chance to understand the algorithms deeper by writing implementations myself.
At the very least it would be a good idea to use LingPipe for comparison and sanity checking against my own code.
Posted by: Paul Dix | May 01, 2007 at 11:39 AM