« November 2007 | Main | February 2008 »

January 20, 2008

ActiveDocument: More than just a document store

Last week I met up with Sebastian, Jake, and others to talk about ActiveDocument and Thrudb. Rick Olson (aka technoweenie) and Ross McFarland (a Thrudb contributor) were also in attendance in #activedocument on freenode. Before I talk about that, I'd like to give out some informational pointers. We've all decided to focus in on Rick's ActiveDocument implementation to bring our efforts together. Sebastian also set up a Google Group for ActiveDocument.

We spent our time asking Jake a bunch of questions about the future of Thrudb and trying to figure out how ActiveDocument could become an alternative data layer for Rails apps (replacing ActiveRecord). There are many pieces to the project, but they reach well beyond being a simple document store. The major parts include Thrucene for indexing, Thrudb as the document store, Thruqueue as a message queue, and Throxy to proxy requests and provide load balancing between Thrudb and Thrucene instances. If AD is going to replace AR and an RDBMS, we'll have to take advantage of most of these pieces.

Thrudb is still very much in development. There isn't an official release and Jake has stated that it will probably just be in the subversion repo for a while. I believe that Throxy still has some work before it's ready, but it's high on the list. That being the case, it will probably be a little while before ActiveDocument has a release.

However, we still came up with some design and implementation things we're thinking/concerned about.

  • Support for multiple serialization formats. Thrudb exmaples use Thrift as the serialization format, AD includes that and JSON.
  • How to deal with concurrency, locking and counters. Thrudb uses atomic writes where the last write wins. If a group is a document and the membership is stored, how do we deal with two processes updating the group membership at the same time?
  • Ability to support other document stores or indexes. Jake is separating Thrudb out into its constituent parts so it's conceivable that you could use SimpleDB as the document store and use the rest.
  • Thinking outside the RDBMS box. AR provides great built in finders and relationship functionality. How much of this should we try to replicate and what other things can we include that just wouldn't have been possible in an RDBMS?
  • Speed for writes and updates. Rick posted some metrics on ActiveDocument vs. ActiveRecord. The write speed is slow due to an update to the index each time a record is created or updated. Jake tells me that this problem only gets worse as the index grows in size. He's actively working on this and has recently improved it quite a bit. However, it still highlights that index design is important when thinking about how an AD schema will be set up.

I haven't had a chance to actually do more development on this. Right now I'm still hacking on Tahiti, but I'm pretty sure that it will use AD and Thrudb in some capacity. I'll post more details as these things develop.

Technorati Tags: ,

January 11, 2008

ThruDB ORM for Ruby

Sebastian Delmont has posted his thoughts about putting together a ThruDB for Rails. If you're not familiar with ThruDB, you should go check it out. It's a document store written by Jake at Third Rail that is in the same vein as CouchDB, SimpleDB, or Google's BigTable (here's an intro to ThruDB).

Anyway, back to ThruDB for Rails. It would be kind of like an ActiveRecord style abstraction for ThruDB. The truth is it really has nothing to do with Rails. It could be thread-safae, support connection pooling, and work with Merb. If done right it could be an abstraction to work with any document style database combined with an index and not just the ThruDB, Lucene, Thrift combination.

After a brief discussion with Sebastian yesterday in #nyc.rb, I got to work to see if I could convert the ThruDB bookmark example to something more ruby-like. Here's what my bookmark example looks like using something I whipped up real quick called ThruMapper.

require 'ThruMapper'

class Bookmark < ThruMapper
  attribute :title, :string
  attribute :url, :string
  attribute :posted_date, :integer
  attribute :tags, :string
end

def load_tsv_file(file)
  open(file).each do |line|
    line.chomp
    bs = line.split("\t")
    Bookmark.create(:url => bs[0], :title => bs[1], :tags => bs[2])
  end
 
  # we'd need to do a commit after each one and possibly have a bulk insert
  # Commit just calls commitAll() on thrucene, which probably updates the index
  Bookmark.commit
end

# just a helper method to print it out
def print_bookmark(bookmark)
  puts "id    : #{bookmark.id}"
  puts "title : #{bookmark.title}"
  puts "url   : #{bookmark.url}"
  puts "tags  : #{bookmark.tags}"
end

# initialize the connection and load up some test data
ThruMapper.initialize_connection
load_tsv_file("../bookmarks.tsv")

# run a few find methods and check their results
Bookmark.find("tags:(+css +examples)",{ :random => 1}).each do |bookmark|
  print_bookmark(bookmark)
end

linux_bookmarks = Bookmark.find("title:(linux)",{:sortby => "title"}).each do |bookmark|
  print_bookmark(bookmark)
end

some_id = linux_bookmarks.first.id

# see the find_by_id method works
print_bookmark(Bookmark.find_by_id(some_id))

# see the find_all method works
puts Bookmark.find_all.size

# now remove them all from the store and clear the index
Bookmark.destroy_all

I reworked the BookmarkManager and didn't use the Thrift auto-generated bookmark files, but it works. ThruDB uses Thrift, which is a Facebook "framework for scalable cross-language services development". There is an annoying thing about Thrift that you have to define a .thrift file for each object type, then run a generator, and then work with the generated code. This is almost exactly like Google's protocol buffers and they were the bane of my existence there over the last summer.

I guess it makes sense to have the whole thing be cross language capable, but it drags dynamic languages down into compiled land. Why have schema files and run static generators when you can have the code be your schema? So following the syntax style Sebastian mentioned, ThruMapper constructs the necessary Thrift pieces for each inherited class.

So I've eliminated the .thrift files along with the generation step. The way I have it running now is that each ThruMapper class expects its own index. This creates a little bit of a configuration problem as the thrucene.conf file needs to be updated with each index. Does anyone have suggestions for dealing with this? I'd like to have the class definition of each ThruMapper class be the one point to get things going.

You can download the ThruMapper bookmark example and the ThruMapper code. Put them in tutorial/rb and run the ThruMapperBookmarkExample.rb to see it go.

Now I have to make it prettier, make it a gem, make it more generic so it can use CouchDB, implement some associations and automatic finders, make it thread safe, and have it pool connections. After all that it might be sweet. Hopefully Sebastian will agree to meet up with me next week to hack on this together. Anyone else up for a ThruDB hack session?

Technorati Tags: , , ,

January 08, 2008

Released Basset Gem for Machine Learning

I've finally managed to release the initial version of Basset, which is a gem for performing machine learning tasks. At this point it's a bit of a stretch to say it's a general purpose machine learning gem. It consists of a single classifier (Naive Bayes), a feature selector based on chi squared, a generic document representation, and an evaluator for testing out different classification algorithms using cross validation. To install do:

gem install 'basset' --include-dependencies

Before I get to some statistics and code, I'd like to give Bryan Helmkamp a big thanks for helping out by refactoring the tests into rspec and doing a little bit of refactoring on the library. I'm planning on improving it over the coming months. Most of my changes will be additional classification algorithms like decision tree, perceptron, and a wrapper around a support vector machine library. I'll also be adding some clustering algorithms too. As I put these additional algorithms in, I'll probably beef up the evaluator so I can run comparisons and tests. I also need to write more tests and get better coverage, which I hope to improve as I work on the other stuff. For the moment I just had the evaluator run my NaiveBayes with feature selection against Lucas Carlson's Classifier gem. I ran it against rec.autos and rec.motorcycles from the 20newsgroups dataset. Here are the last few lines of output from Basset's ClassificationEvaluator:

Trained on 17860 documents on 10 cross validation runs.
External classified 1945 of 1980 correctly for 98.23% accurcy.
Executed run in 218.1 seconds.
Basset::NaiveBayes classified 1903 of 1980 correctly for 96.11%
accurcy. Executed run in 51.1 seconds.

As you can see the Classifier gem actually performs a little better on accuracy while Basset runs much faster. The lower accuracy from Basset is probably due to the lame default document representation, which is something that anyone seriously using the library should be overriding. I've provided an overridden document example in the library. Here's the test code I whipped up real quick to run the two against each other. It shows use of the ClassificationEvaluator.

require 'rubygems'
require 'basset'
require 'classifier'

def class_name(class_dir)
  class_dir.split(".").last
end

class1_dir = "rec.autos/"
class2_dir = "rec.motorcycles/"
class1_name = class_name(class1_dir)
class2_name = class_name(class2_dir)

docs = [Dir.entries(class1_dir).slice(2, 10000).collect {|n|
Basset::Document.new(File.open(class1_dir + n).
readlines.join(" "), class1_name)}]
docs << Dir.entries(class2_dir).slice(2, 10000).collect {|n|
Basset::Document.new(File.open(class2_dir + n).
readlines.join(" "), class2_name)}

evaluator = Basset::ClassificationEvaluator.new(docs)

# here we're specifiying the Basset classifiers to run against,
the chi-value that is the cut-off in
# feature selection, and a block which the evaluator uses to run
the external classifier.
evaluator.compare_against_basset_classifiers(
[Basset::NaiveBayes.new()],
1.0) do |training_set, testing_set|
  classifier = Classifier::Bayes.new(class1_name, class2_name)
 
  puts "training Classifier::NaiveBayes on #{training_set.size} documents..."
  training_set.each do |document|
    classifier.train(document.classification, document.text)
  end
 
  puts "running Classifier::NaiveBayes on #{testing_set.size} documents..."
  number_correctly_classified = 0
  testing_set.each do |document|
    number_correctly_classified += 1 if
classifier.classify(document.text).downcase ==
document.classification
  end
 
  number_correctly_classified
end

Technorati Tags: , , ,