July 02, 2008

Tahiti gets a name and an initial code push

I haven't given a real update on Tahiti since my post last September announcing that I was putting it on hold. That's mostly been true since then. My schedule has prevented me from finding time to work on the project. In that time I've worked as a consultant with two different companies, completed multiple projects, presented at two conferences, attended two other conferences, and completed two full time semesters at Columbia.

However, I've still been thinking about the project and trying to devise a plan to get a prototype out. In January I decided to team up with Mint Digital to help out with design work while I did some Ruby consulting for them. I worked with their designers and put together some wire frames to solidify my ideas on the project and even came up with a proper name: Filterly.

My hope was to work on it during the spring semester, but commitments in and out of school kept me from finding time to code. Well now that it's summer and I'm not in school, I've been sneaking in some time here and there while not working at Mint. I written some of the back end code for crawling and updating feeds, which was the source of my recent posts about background processing and message queues.

I finally pushed out some code to EC2 and it should be running full time from here on. Right now it's just back end stuff and a landing page, but it's a start. If you're curious about what the hell the goal of the project is, please visit the landing page at Filterly.com where I lay out the basic idea.

I'm also posting this so that if people find the Filterly.com crawler in their server logs, this will have something more than just a note that I'm not working on it any more. Ok, enough of this pointless blather. I promise my next post will be more interesting. Probably about my EC2 setup and my plan to spend as little as possible on hosting costs.

June 30, 2008

Digg's New Recommendation Engine has a Classic Problem

Mashable recently posted about an upcoming feature addition to Digg: a recommendation engine. After reading the post and watching Kevin's screen cast on the new feature, I think it looks interesting. Improving the signal to noise ratio on Digg is definitely a good idea and I think a recommendation engine is a step in the right direction. However, the system as described by Kevin has a big hole in it: it's worthless to people with no digging history or friends.

This is a typical problem in recommendation systems. How do you make the system useful to users for which you have no data? If you look at usage patterns for users on UGC (user generated content) websites, this isn't a small problem. Numbers for web sites I've worked on have < 5% of users submitting content, and < 10% of users voting on content. I have no idea what Digg's numbers are, but I'm sure they're not too far off. This means the new recommendation system is completely useless for around 90% of the people that use Digg.

The news isn't all bad though. It's a short stretch to implement recommendation systems that are based on item similarity. This way a user that has no history on Digg could come to a news item and click a link that says "find me more on Digg like this." The methods aren't that much different than the user similarity measures. In fact, the users that are voting help create their own recommendations are also giving the system the data it needs to calculate these item recommendations.

Another method is to collect passive data like a user clicking on a news link on Digg. They're not actively voting the article up or down, but there is an implicit indication of interest in the article. You can then use this data to make a guess about what the user may like. So this method requires a little bit of data, but doesn't require the user to be an "active digger."

In any case, I think it will be interesting to see how they continue to develop the system. The information filtering space is one that will only become more important as the exponential growth of content continues.

June 25, 2008

How I Got Started Programming

Giles Bowkett tagged me to write these answers up. Read his first since it's probably more interesting.

How old were you when you started programming?
On my 9th birthday my grandmother gave me a Commodore 64C. At that point I had no idea what a computer was or what they were for. I had played with my brother on an Atari 2600 so I figured it was for gaming. Luckily, it came with some big manuals about setting it up, using it, and even a basic programming book. My dad helped me get started and after that I started exploring. I remember the whole thing seemed magical. What could it do? What could I make it do? I had the feeling that it opened up an entire world of unknown possibility.

How did you get started programming?
Like Giles, I had a long break after that initial rush. I played around with basic on the 64C until around the time I was 11 when I became more interested in skateboarding and skiing. I picked it up again for fun when I was 18, but only managed to teach myself enough C to get frustrated. Since I didn't have money to go to college, my immediate concern was trying to eek out a living. So I got into PC Tech work and worked my way up the ranks.

I moved into some Windows NT 4.0 administration and even picked up the now mocked and reviled MCSE (Microsoft Certified Systems Engineer). However, it enabled me to go to Redmond and work as a tester on the Windows 2000 team, which is where I got my first real exposure to programming. I wrote VBScript to automate the loading of test environments. I studied with other testers reading books on Pascal, C, and C++. However, I wouldn't call this a programming job. I was merely trying to expand into that area while doing more traditional Q.A.

After Windows 2000 shipped I went to work as a consultant migrating NT 4.0 installations. I shelved programming again for a few years with the hope that I could eventually move back into it. Yet, I moved further away from programming by picking up work as a network engineer playing with Cisco gear and frame relay connections. The move into network engineering is what eventually led me back to programming for good.

One night I had to perform an upgrade of two routers critical to a cable company's business. The pay-per-view traffic for a 100 site cable network came through these two routers. I had to upgrade the software on them and move them from one service provider to another. They were housed in one of those secure server rooms that's kept at a cool 60 degrees. I was led into the room and started the upgrade in the middle of the night. Unfortunately, during the upgrade something went wrong and the routers just wouldn't come back up! Now pay-per-view was out for every customer! After an hour of sweating bullets, and a few frantic phone calls waking people up in the middle of the night, I got things going.

On my drive home I promised myself that I would go back to school and also strive to get into programming professionally. I never wanted to deal with that situation again. None of the people I called could help me and I realized that I didn't have any real understanding of how these systems worked. To this day I have no idea what it was that fixed the problem (despite the fact that it was something that I did).

What was your first language?
Basic on the Commodore 64C was my first. C was the next thing I tried to learn, but real understanding didn't come until much later. Delphi was the first language that I ended up writing as a full time programmer.

What was the first real program you wrote?
The first thing that I felt was real was a web application for performing security assessments and compliance audits. The product actually never saw any decent customer usage (which is why the startup failed) so I'm not sure you can call it a "real" program. However, we had multiple releases and a few customers.

I've written a bunch since then, but I'm still striving to produce something that I'm impressed by. That's really what my work on Tahiti is all about.

What was your first professional programming gig?
That was in 2001 for the previously mentioned startup. At first it was a regular client application written in Delphi. We then worked on a web application written in VB.NET. The application itself was pretty mundane, but I was excited about the job because I was working in a startup with friends and it was my first job as a programmer. We ended up working a bunch of 80 hour work weeks and I learned a ton. I learned about programming, startups, how selling software in 2001-2003 is nothing like selling software in 1999, and how you're never more alive at work than when you're working on something new with a small team.

If there is one thing you learned along the way that you would tell new developers, what would it be?
You aren't and never will be the best programmer in the world, but it doesn't mean you shouldn't try. Corollary: Let Ryan Davis and Zed Shaw fight for the title of the biggest badass programmer, don't be too intimidated and keep expanding your skill set.

What's the most fun you've ever had programming?
Any time I sit down to work on Tahiti. Working on your own project and vision is by far the most fun you can have.


Who's up next?
Trotter Cashion
Zed Shaw
Josh Knowles
Ryan Davis
Geoffrey Grosenbach
Bryan Helmkamp
Obie Fernandez
Josh Susser

June 19, 2008

Trouble With Integrated Background Rails Libraries

Originally this post was going to be about how I was using Adam Pisoni's ruby gem named Skynet to do distributed background processing. However, this is not a story of success with that library.

The concept of Skynet is really cool: a dead simple MapReduce implementation with ActiveRecord integration. Installing it was a breeze and getting it to work on my development system was quick and painless. Getting it running in production was an entirely different story. First I ran into problems with the logging. Skynet logs information to its own log files so I had to make sure the permissions on that were set up correctly. Then when  I tried to get it running on another EC2 instance it just wouldn't work. No errors, no information, no nothing. So I decided to jump into the code and see if I could figure things out. A few hours later after looking through quite a bit of code I was no closer to knowing why it was failing.

I then realized something. I'm not trying to do something horribly complex. I don't need the map reduce paradigm. I just needed a queue and a way to process jobs from it. What's more, I didn't even need a disk based persistent queue. I think this is a failing of a bunch of these fully integrated background processing libraries. They're tied to drb, or a  specific type of queue, or the workers aren't customizable, or the workers have undesirable behavior (like BJ spawning a full Rails instance for every job execution). The problem is that I wanted a few basics like processing a method on an AR object asynchronously (something all of these do to varying degrees of success), but I also wanted some very custom behavior for how workers handle web crawling jobs, which is a task that I have to do a bit of.

I ended up going with Beanstalkd and the beanstalk-client gem (for a good intro to beanstalk read Geoffrey's post). I decided not to use the Rails integrated async-observer because the worker it provides didn't work for my needs. Instead, I wrote a basic wrapper for the queue, a generic job wrapper, and a worker. The code for the worker is something I really like because it's crazy simple. Have a look:

require "#{File.dirname(__FILE__)}/../config/environment.rb"

class QueueWorker
  def initialize(logger, queue)
    @queue = queue
    @logger = logger
  end
 
  def start(continue_loop = true)
    begin
      begin
        job = @queue.reserve
        job.process
        @queue.mark_as_finished(job)
      end while continue_loop
    rescue Exception => e
      @logger.error "Queue worker encountered an error: #{e.message}"
    end
  end 
end

QueueWorker.new(RAILS_DEFAULT_LOGGER, QueueWrapper.new(ARGV.first)).start

I like this worker because it doesn't know anything about the job or the queue. As long as the queue responds to 'reserve' and 'mark_as_finished', it can be anything. Likewise, the thing that is returned from reserve only needs to respond to 'process'. The logger object only needs to respond to 'error'. The worker is really stupid. It just grabs things off the queue and processes them and throws an error if execution stops. I could easily substitute the RAILS_DEFAULT_LOGGER in initialization with a wrapper to a monitoring service that will alert me on failure.

The QueueWrapper and another class called JobWrapper are the only points of interface between generic regular ruby code and the queueing mechanism. It's their responsibility to make sure that reserve and mark_as_finished methods exist. The JobWrapper also comes into play when pushing jobs onto the queue and marshaling them from the queue.

The real magic happens in the job definitions. With this structure I can define any kind of job I want. Here's the important snippet of some code for a generic AR async method call:

  def process
    model_name.constantize.find_by_id(id).send(method_to_send)
  end
 
  def self.queue_job(queue, model_instance, method)
    queue.push(GenericJob.new(model_instance.class.to_s, model_instance.id, method))
  end

And the active record integration looks like this:

class ActiveRecord::Base
  def send_later(method)
    GenericJob.queue_job(ActiveRecord::Base.app_background_queue, self, method)
  end
 
  def self.app_background_queue
    @@app_background_queue ||= QueueWrapper.new("#{BEANSTALK_HOST}:#{BEANSTALK_PORT}")
  end
end

So I built async processing for AR objects in less than a few hundred lines of code. The cool thing with this approach is that I can now define new jobs for crawling that have different process behavior. I actually queue up full batches of things instead of a single AR object. I can also define specific processing behavior like using multiple threads to deal with slow network IO. I like it because the worker that runs this job doesn't ever have to know about all this. It's completely agnostic to the types of jobs that it processes.

Wow, this ended up going on for a quite a while. I guess the lesson learned for me is that sometimes it's quicker to just pick up a few basic building blocks and wire them together.

Note: In fairness to Adam, he's busy working on a new version of Skynet that will offer other queuing options. It's definitely a project worth keeping an eye on.

June 12, 2008

Mocking Web Requests in Ruby with FakeWeb

I've been using rFeedParser lately to do a little feed parsing. One of the problems I ran into was trying to test the thing. Since there's network IO involved, I obviously want to get some mock action. In my first attempt I ended up having to modify rFeedParser's call to open to explicitly calling OpenURI.open so that I could mock it out with rspec like this:

OpenURI.should_receive(:open_uri).at_least(:once).with(feed_url, {
  "Accept-encoding"=>"gzip, deflate",
  "User-Agent"=>"Filterly +http://www.filterly.com",
  "Accept"=>"application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1",
  "A-IM"=>"feed"
  }).and_return(*xml_mocks)

xml_mocks is an array of strings that hold the xml I want to return on my sequence of calls. As you can see it's ugly as hell and not that fun to use. Last night at the New York City Ruby meeting (known as nyc.rb) I got a little pointer from Bryan Helmkamp to a library called FakeWeb. I have no clue how this escaped my notice until now, but it makes things much cleaner and easier. FakeWeb is a library written by Blaine Cook for faking web requests. Grab it like so:

sudo gem install FakeWeb

Then have fun with it like so:

require 'fake_web'
require 'open-uri'
# from the contents of a file:
FakeWeb.register_uri("http://www.pauldix.net/", :file => "some_file.xml")
# or from the contents of a string:
FakeWeb.register_uri("http://www.pauldix.net/", :string => "foo")
# and boom:
open("http://www.pauldix.net").read # =>  "foo"

There's only one gotcha to look out for. When you call FakeWeb.register_uri, make sure that the uri you pass in has a slash at the end. Your call to open can include the slash or not, but if you don't register the uri with the slash, the faking won't actually happen.