I've finally managed to release the initial version of Basset, which is a gem for performing machine learning tasks. At this point it's a bit of a stretch to say it's a general purpose machine learning gem. It consists of a single classifier (Naive Bayes), a feature selector based on chi squared, a generic document representation, and an evaluator for testing out different classification algorithms using cross validation. To install do:
gem install 'basset' --include-dependencies
Before I get to some statistics and code, I'd like to give Bryan Helmkamp a big thanks for helping out by refactoring the tests into rspec and doing a little bit of refactoring on the library. I'm planning on improving it over the coming months. Most of my changes will be additional classification algorithms like decision tree, perceptron, and a wrapper around a support vector machine library. I'll also be adding some clustering algorithms too. As I put these additional algorithms in, I'll probably beef up the evaluator so I can run comparisons and tests. I also need to write more tests and get better coverage, which I hope to improve as I work on the other stuff. For the moment I just had the evaluator run my NaiveBayes with feature selection against Lucas Carlson's Classifier gem. I ran it against rec.autos and rec.motorcycles from the 20newsgroups dataset. Here are the last few lines of output from Basset's ClassificationEvaluator:
Trained on 17860 documents on 10 cross validation runs.
External classified 1945 of 1980 correctly for 98.23% accurcy.
Executed run in 218.1 seconds.
Basset::NaiveBayes classified 1903 of 1980 correctly for 96.11%
accurcy. Executed run in 51.1 seconds.
As you can see the Classifier gem actually performs a little better on accuracy while Basset runs much faster. The lower accuracy from Basset is probably due to the lame default document representation, which is something that anyone seriously using the library should be overriding. I've provided an overridden document example in the library. Here's the test code I whipped up real quick to run the two against each other. It shows use of the ClassificationEvaluator.
require 'rubygems'
require 'basset'
require 'classifier'
def class_name(class_dir)
class_dir.split(".").last
end
class1_dir = "rec.autos/"
class2_dir = "rec.motorcycles/"
class1_name = class_name(class1_dir)
class2_name = class_name(class2_dir)
docs = [Dir.entries(class1_dir).slice(2, 10000).collect {|n|
Basset::Document.new(File.open(class1_dir + n).
readlines.join(" "), class1_name)}]
docs << Dir.entries(class2_dir).slice(2, 10000).collect {|n|
Basset::Document.new(File.open(class2_dir + n).
readlines.join(" "), class2_name)}
evaluator = Basset::ClassificationEvaluator.new(docs)
# here we're specifiying the Basset classifiers to run against,
the chi-value that is the cut-off in
# feature selection, and a block which the evaluator uses to run
the external classifier.
evaluator.compare_against_basset_classifiers(
[Basset::NaiveBayes.new()],
1.0) do |training_set, testing_set|
classifier = Classifier::Bayes.new(class1_name, class2_name)
puts "training Classifier::NaiveBayes on #{training_set.size} documents..."
training_set.each do |document|
classifier.train(document.classification, document.text)
end
puts "running Classifier::NaiveBayes on #{testing_set.size} documents..."
number_correctly_classified = 0
testing_set.each do |document|
number_correctly_classified += 1 if
classifier.classify(document.text).downcase ==
document.classification
end
number_correctly_classified
end
Technorati Tags: ruby, basset, machinelearning, naivebayes
Comments