September 04, 2008

Marshal data too short error with ActiveRecord

In my previous post about the speed of serializing data, I concluded that Marshal was the quickest way to get things done. So I set about using Marshal to store some data in an ActiveRecord object. Things worked great at first, but on some test data I got this error: marshal data too short. Luckily, Bryan Helmkamp had helpfully pointed out that there were sometimes problems with storing marshaled data in the database. He said it was best to base64 encode the marshal dump before storing.

I was curious why it was working on some things and not others. It turns out that some types of data being marshaled were causing the error to pop up. Here's the test data I used in my specs:

{ :foo => 3, :bar => 2 } # hash with symbols for keys and integer values
[3, 2.1, 4, 8]           # array with integer and float values

Everything worked when I switched the array values to all integers so it seems that floats were causing the problem. However, in the interest of keeping everything working regardless of data types, I base64 encoded before going into the database and decoded on the way out.

I also ran the benchmarks again to determine what impact this would have on speed. Here are the results for 100 iterations on a 10k element array and a 10k element hash with and without base64 encode/decode:

                user       system     total       real
array marshal  0.200000   0.010000   0.210000 (  0.214018) (without Base64)
array marshal  0.220000   0.010000   0.230000 (  0.250260)

hash marshal   1.830000   0.040000   1.870000 (  1.892874) (without Base64)
hash marshal   2.040000   0.100000   2.140000 (  2.170405)

As you can see the difference in speed is pretty negligible. I assume that the error has to do with AR cleaning the stuff that gets inserted into the database, but I'm not really sure. In the end it's just easier to use Base64.encode64 when serializing data into a text field in ActiveRecord using Marshal.

I've also read people posting about this error when using the database session store. I can only assume that it's because they were trying to store either way too much data in their session (too much for a regular text field) or they were storing float values or some other data type that would cause this to pop up. Hopefully this helps.

August 27, 2008

Serializing data speed comparison: Marshal vs. JSON vs. Eval vs. YAML

Last night at the NYC Ruby hackfest, I got into a discussion about serializing data. Brian mentioned the Marshal library to me, which for some reason had completely escaped my attention until last night. He said it was wicked fast so we decided to run a quick benchmark comparison.

The test data is designed to roughly approximate what my stored classifier data will look like. The different methods we decided to benchmark were Marshal, json, eval, and yaml. With each one we took the in-memory object and serialized it and then read it back in. With eval we had to convert the object to ruby code to serialize it then run eval against that. Here are the results for 100 iterations on a 10k element array and a hash with 10k key/value pairs run on my Macbook Pro 2.4 GHz Core 2 Duo:

                 user      system     total       real
array marshal  0.210000   0.010000   0.220000 (  0.220701)
array json     2.180000   0.050000   2.230000 (  2.288489)
array eval     2.090000   0.060000   2.150000 (  2.240443)
array yaml    26.650000   0.350000  27.000000 ( 27.810609)

hash marshal   2.000000   0.050000   2.050000 (  2.114950)
hash json      3.700000   0.060000   3.760000 (  3.881716)
hash eval      5.370000   0.140000   5.510000 (  6.117947)
hash yaml     68.220000   0.870000  69.090000 ( 72.370784)

The order in which I tested them is pretty much the order in which they ranked for speed. Marshal was amazingly fast. JSON and eval came out roughly equal on the array with eval trailing quite a bit for the hash. Yaml was just slow as all hell. A note on the json: I used the 1.1.3 library which uses c to parse. I assume it would be quite a bit slower if I used the pure ruby implementation. Here's a gist of the benchmark code if you're curious and want to run it yourself.

If you're serializing user data, be super careful about using eval. It's probably best to avoid it completely. Finally, just for fun I took yaml out (it was too slow) and ran the benchmark again with 1k iterations:

                 user      system     total       real
array marshal  2.080000   0.110000   2.190000 (  2.242235)
array json    21.860000   0.500000  22.360000 ( 23.052403)
array eval    20.730000   0.570000  21.300000 ( 21.992454)

hash marshal  19.510000   0.500000  20.010000 ( 20.794111)
hash json     39.770000   0.670000  40.440000 ( 41.689297)
hash eval     51.410000   1.290000  52.700000 ( 54.155711)

August 07, 2008

Storing many classification models

One of the things I need to do in Filterly is keep many trained classifiers. These are the machine learning models that determine if a blog post is on topic (Filterly separates information by topic). At the very least I need one per topic in the system. If I want to do something like boosting then I need even more. The issue I'm wrestling with is how to store this data. I'll outline a specific approach and what the storage needs are.

Let's say I go with boosting and 10 perceptrons. I'll also limit my feature space to the 10,000 most statistically significant features. So the storage for each perceptron is a 10k element array. However, I'll also have to keep another data structure to store what the 10k features are and their position in the array. In code I use a hash for this where the feature name is the key and the value is its position. I just need to store one of these hashes per topic.

That's not really a huge amount of data. I'm more concerned about the best way to store it. I don't think this kind of thing maps well to a relational database. I don't need to store the features individually. Generally when I'm running the thing I'll want the whole perceptron and feature set in memory for quick access. For now I'm just using a big text field and serializing each using JSON.

I don't really like this approach. The whole serializing into the database seems really inelegant. Combined with the time that it takes to parse these things. Each time I want to see if a new post is on topic I'd need to load up the classifier and parse the 10 10k arrays and the 10k key hash. I could keep each classifier running as a service, but then I've got a pretty heavy process running for each topic.

I guess I'll just use the stupid easy solution for the time being and worry about performance later. Anyone have thoughts on the best approach?

July 08, 2008

Living on the cheap with EC2

Project Tahiti (recently dubbed Filterly) is a bit of a resource whore. Actually, it's worse than that. There are a few things that make it more resource heavy than a typical web application. My needs from a really high view are something like this:

     
  • scalable data store
  •  
  • web application layer
  •  
  • crawling and feed parsing workers
  •  
  • machine learning workers

A typical web application usually only has to worry about the first two items (and probably some basic background processing). On some apps the data store doesn't even need to be that scalable as long as you're using caching intelligently. With Filterly, the background workers can get quite expensive in terms of machine resources and stress on the data store.

In an ideal world I'd have a Google cluster handy, but I'm an impoverished college student so I have to make due with limited resources. Very limited. Here's my strategy for using EC2 to minimize my costs. Currently, my configuration looks something like this:

     
  • 1 instance running a customized EC2 on Rails which includes database, web application, a beanstalkd queue, and a custom daemon written by me.

Total base cost: $73 per month plus a few bucks for some S3 backups. For now I'm just using the base EC2onRails backup strategy, which is a full backup every night with incremental backups every 10 minutes all going to S3. I'm planning on moving the database to flexible storage once it's out to the general public, but for now this will do. I'm not storing banking transactions.

The cool and thrifty part is the daemon. Every two hours it loads up the beanstalkd queue with jobs and starts up a custom EC2 instance based on a basic Ubuntu AMI. The instance then downloads the latest Filterly code from S3 and runs a script. The file starts up a bunch of worker processes that pull off the beanstalkd queue. The daemon monitors the queue until it's empty then shuts down the EC2 instance. I could have used SQS, but I like beanstalkd more. It's quicker than SQS, and is still feature rich with things like priorities, delayed jobs, and reservations. Most importantly: it's free since it's running on an instance I'm already paying for.

I have this run about six times a day so it adds 60 cents a day or roughly $18 per month to my total cost. This is perfect for my current stage of development (before private beta). Once I launch into private beta, I'll probably keep a worker instance running constantly to update the most trafficked topics and feeds.

The real value comes when I scale up the private beta and the index gets larger. It's a simple change in the daemon to make it launch 10 instances for crawling, updating, and machine learning just for an hour six times a day. So all of the topics in Filterly will get six updates per day for around $180 per month. Obviously, I'd like them updating constantly, but I think this is enough to keep it interesting and it's still affordable. The 10 instance number is completely arbitrary at this point since I'm not far enough along to be sure about how much resources these things will consume, but the idea is the same whether it's 2 or 200 instances.

Finally, this model hasn't addressed the scalable data store problem. I'll probably move over to HBase, CouchDB, or ThruDB later on, but in prototype mode I just need to get something up quickly.

July 02, 2008

Tahiti gets a name and an initial code push

I haven't given a real update on Tahiti since my post last September announcing that I was putting it on hold. That's mostly been true since then. My schedule has prevented me from finding time to work on the project. In that time I've worked as a consultant with two different companies, completed multiple projects, presented at two conferences, attended two other conferences, and completed two full time semesters at Columbia.

However, I've still been thinking about the project and trying to devise a plan to get a prototype out. In January I decided to team up with Mint Digital to help out with design work while I did some Ruby consulting for them. I worked with their designers and put together some wire frames to solidify my ideas on the project and even came up with a proper name: Filterly.

My hope was to work on it during the spring semester, but commitments in and out of school kept me from finding time to code. Well now that it's summer and I'm not in school, I've been sneaking in some time here and there while not working at Mint. I written some of the back end code for crawling and updating feeds, which was the source of my recent posts about background processing and message queues.

I finally pushed out some code to EC2 and it should be running full time from here on. Right now it's just back end stuff and a landing page, but it's a start. If you're curious about what the hell the goal of the project is, please visit the landing page at Filterly.com where I lay out the basic idea.

I'm also posting this so that if people find the Filterly.com crawler in their server logs, this will have something more than just a note that I'm not working on it any more. Ok, enough of this pointless blather. I promise my next post will be more interesting. Probably about my EC2 setup and my plan to spend as little as possible on hosting costs.

June 19, 2008

Trouble With Integrated Background Rails Libraries

Originally this post was going to be about how I was using Adam Pisoni's ruby gem named Skynet to do distributed background processing. However, this is not a story of success with that library.

The concept of Skynet is really cool: a dead simple MapReduce implementation with ActiveRecord integration. Installing it was a breeze and getting it to work on my development system was quick and painless. Getting it running in production was an entirely different story. First I ran into problems with the logging. Skynet logs information to its own log files so I had to make sure the permissions on that were set up correctly. Then when  I tried to get it running on another EC2 instance it just wouldn't work. No errors, no information, no nothing. So I decided to jump into the code and see if I could figure things out. A few hours later after looking through quite a bit of code I was no closer to knowing why it was failing.

I then realized something. I'm not trying to do something horribly complex. I don't need the map reduce paradigm. I just needed a queue and a way to process jobs from it. What's more, I didn't even need a disk based persistent queue. I think this is a failing of a bunch of these fully integrated background processing libraries. They're tied to drb, or a  specific type of queue, or the workers aren't customizable, or the workers have undesirable behavior (like BJ spawning a full Rails instance for every job execution). The problem is that I wanted a few basics like processing a method on an AR object asynchronously (something all of these do to varying degrees of success), but I also wanted some very custom behavior for how workers handle web crawling jobs, which is a task that I have to do a bit of.

I ended up going with Beanstalkd and the beanstalk-client gem (for a good intro to beanstalk read Geoffrey's post). I decided not to use the Rails integrated async-observer because the worker it provides didn't work for my needs. Instead, I wrote a basic wrapper for the queue, a generic job wrapper, and a worker. The code for the worker is something I really like because it's crazy simple. Have a look:

require "#{File.dirname(__FILE__)}/../config/environment.rb"

class QueueWorker
  def initialize(logger, queue)
    @queue = queue
    @logger = logger
  end
 
  def start(continue_loop = true)
    begin
      begin
        job = @queue.reserve
        job.process
        @queue.mark_as_finished(job)
      end while continue_loop
    rescue Exception => e
      @logger.error "Queue worker encountered an error: #{e.message}"
    end
  end 
end

QueueWorker.new(RAILS_DEFAULT_LOGGER, QueueWrapper.new(ARGV.first)).start

I like this worker because it doesn't know anything about the job or the queue. As long as the queue responds to 'reserve' and 'mark_as_finished', it can be anything. Likewise, the thing that is returned from reserve only needs to respond to 'process'. The logger object only needs to respond to 'error'. The worker is really stupid. It just grabs things off the queue and processes them and throws an error if execution stops. I could easily substitute the RAILS_DEFAULT_LOGGER in initialization with a wrapper to a monitoring service that will alert me on failure.

The QueueWrapper and another class called JobWrapper are the only points of interface between generic regular ruby code and the queueing mechanism. It's their responsibility to make sure that reserve and mark_as_finished methods exist. The JobWrapper also comes into play when pushing jobs onto the queue and marshaling them from the queue.

The real magic happens in the job definitions. With this structure I can define any kind of job I want. Here's the important snippet of some code for a generic AR async method call:

  def process
    model_name.constantize.find_by_id(id).send(method_to_send)
  end
 
  def self.queue_job(queue, model_instance, method)
    queue.push(GenericJob.new(model_instance.class.to_s, model_instance.id, method))
  end

And the active record integration looks like this:

class ActiveRecord::Base
  def send_later(method)
    GenericJob.queue_job(ActiveRecord::Base.app_background_queue, self, method)
  end
 
  def self.app_background_queue
    @@app_background_queue ||= QueueWrapper.new("#{BEANSTALK_HOST}:#{BEANSTALK_PORT}")
  end
end

So I built async processing for AR objects in less than a few hundred lines of code. The cool thing with this approach is that I can now define new jobs for crawling that have different process behavior. I actually queue up full batches of things instead of a single AR object. I can also define specific processing behavior like using multiple threads to deal with slow network IO. I like it because the worker that runs this job doesn't ever have to know about all this. It's completely agnostic to the types of jobs that it processes.

Wow, this ended up going on for a quite a while. I guess the lesson learned for me is that sometimes it's quicker to just pick up a few basic building blocks and wire them together.

Note: In fairness to Adam, he's busy working on a new version of Skynet that will offer other queuing options. It's definitely a project worth keeping an eye on.

June 12, 2008

Mocking Web Requests in Ruby with FakeWeb

I've been using rFeedParser lately to do a little feed parsing. One of the problems I ran into was trying to test the thing. Since there's network IO involved, I obviously want to get some mock action. In my first attempt I ended up having to modify rFeedParser's call to open to explicitly calling OpenURI.open so that I could mock it out with rspec like this:

OpenURI.should_receive(:open_uri).at_least(:once).with(feed_url, {
  "Accept-encoding"=>"gzip, deflate",
  "User-Agent"=>"Filterly +http://www.filterly.com",
  "Accept"=>"application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1",
  "A-IM"=>"feed"
  }).and_return(*xml_mocks)

xml_mocks is an array of strings that hold the xml I want to return on my sequence of calls. As you can see it's ugly as hell and not that fun to use. Last night at the New York City Ruby meeting (known as nyc.rb) I got a little pointer from Bryan Helmkamp to a library called FakeWeb. I have no clue how this escaped my notice until now, but it makes things much cleaner and easier. FakeWeb is a library written by Blaine Cook for faking web requests. Grab it like so:

sudo gem install FakeWeb

Then have fun with it like so:

require 'fake_web'
require 'open-uri'
# from the contents of a file:
FakeWeb.register_uri("http://www.pauldix.net/", :file => "some_file.xml")
# or from the contents of a string:
FakeWeb.register_uri("http://www.pauldix.net/", :string => "foo")
# and boom:
open("http://www.pauldix.net").read # =>  "foo"

There's only one gotcha to look out for. When you call FakeWeb.register_uri, make sure that the uri you pass in has a slash at the end. Your call to open can include the slash or not, but if you don't register the uri with the slash, the faking won't actually happen.

September 26, 2007

Tahiti on Hold

Here is a quick update in case anyone is following my progress on Tahiti. I've put the project on hold for a few months. I had to pick up a little consulting work to bring in some extra cash. That in addition to a full load at school has me completely swamped. The up side is that I'm working with the guys at EastMedia again.

One of my coworkers, Bryan Helmkamp, has an interesting blog if you're looking for a little Ruby reading material.

August 28, 2007

Tahiti Update: Didn't Make The TechCrunch20

This space has been quiet for a while so here's a quick update to let people know I'm still around. I heard back that project Tahiti didn't make the TechCrunch20. This was actually a good thing since the last month has been so busy with other stuff. Now I can move back to my original development plan which is to release to a small topic focused group and expand as I work out the kinks.

I wrapped up everything in Silicon Valley and now I'm back in the Alley. I'm ready to continue work on the project and will hopefully launch a private beta for people in the Ruby community sometime in September. It will be tough to keep it going while I'm starting a new semester, but hopefully I'll be able to pull it off.

I'm working out of CooperBricolage today and going to the nyc.rb hackfest later tonight.

July 20, 2007

Server Setup is The Suck

I'm trying to get my application setup and deployed on EC2. It's not going well. I'm using Paul Dowman's AMI, but of course it's giving me problems. Fix one problem then on to the next, then the next, then the next. Maybe later I'll restart the whole thing and take notes on the failures. Now if you'll excuse me, I have to find a small
animal to kick.

Update: I finally got it up. After more sweat and toil I found that it was probably just something really stupid on my part. Tomorrow night I'll have to make another image for the workers. Thanks for making the image Paul, you rock!

Technorati Tags: , , ,