Last night I was working on a part of Tahiti that requires some multi-process love. It's one of the pieces that gathers data. Basically just a feed consumer. The trick is that I want to be able to eat up a bunch of feeds in a short amount of time which is where multi-process comes in. I could just write some multi-threaded stuff, but I want to be able to scale to multiple processors and multiple machines.
I've never tried this stuff in Ruby so I started looking for options. Here's the quick rundown of buzzword names that I found in my search.
- DRb - Distributed Ruby (Like RMI or CORBA)
- Rinda - Ruby implementation of Linda (part of DRb for services and clients to auto-discover each other)
- BackgrounDRb - job server and scheduler for running long running tasks (primarily for use with Rails)
- AP4R - asynchronous processing for Ruby
- Starfish - map reduce for Ruby
DRb and Rinda look to be the basic building blocks for the later three services. BackgrounDRb looked interesting although I'm not sure it's what I need. I'm not kicking off this thing from a Rails controller action, and I don't need to feed info back in to the front end. This is purely a background task that needs to run all the time. AP4R also looks interesting, but it seems most of the documentation is in Japanese.
After a quick look at Starfish I decided that it was what I would attempt to use to accomplish this task. It's advertised as a simple Ruby version of map reduce by its creator Lucas Carlson. I like how simple the interface is and it would be perfect for my needs if it weren't for one huge problem. The memory usage of the processes it kicks off are huge (something like 30 MB)! I assume this is because of ActiveRecord. Unfortunately this eliminates Starfish/ActiveRecord as an option since the VPS I'm setting the prototype up on has only 256 MB of RAM.
I have a few other observations on Starfish. It's really only the map part of map-reduce (which is still quite useful). Also, I don't quite understand how to start up multiple clients with it. The documentation says to just run:
starfish my_awesome_map_reduce.rb
I guess that starts the whole thing and then running that again will kick off additional clients. I need to dig into the starfish code to more closely understand things, but that didn't seem like a decent way to kick off multiple clients. The issue I had is that if you're kicking them off using a rake task, how can you be sure to start up the additional clients after the server has successfully started? Here's the super lame thing I did:
namespace :crawl do
desc "Kicks off Starfish 3 times for feed_updater.rb"
task :feed_updater => :environment do
method(:fork).call { system("starfish lib/feed_updater.rb") }
sleep(30)
2.times do
method(:fork).call { system("starfish lib/feed_updater.rb") }
end
end
end
Of course that was just to test things and now that I know these processes are whores for memory, the whole strategy is out anyway. I am curious how other people using Starfish kick off the additional clients.
So now I have to look for something else to accomplish this task. I'll probably end up using DRb and Rinda, but write my own custom code for the other stuff and skip using ActiveRecord. So much for easy distributed computing abstraction.
As a final aside, Lucas mentions how distributing the task of processing a log file slowed things down by 20x because of the problem of communicating and sending the data over the wire. I assume MapReduce would have this problem too if it weren't for the Google File System (GFS). One of the things the paper mentions is the importance of taking advantage of locality of data. Meaning that calls to process certain chunks of the input are put to machines that have a cached copy or are in the same rack as the machine that holds the data. The GFS paper also mentions data redundancy across the cluster which is something I would think might help MapReduce run faster.
Now it's back to the drawing board. If you have any suggestions or pointers to good reading materials about distributed computing in Ruby I would greatly appreciate it.
Technorati Tags: ruby, starfish, drb, rubyonrails, mapreduce