« May 2007 | Main | August 2007 »

July 20, 2007

Server Setup is The Suck

I'm trying to get my application setup and deployed on EC2. It's not going well. I'm using Paul Dowman's AMI, but of course it's giving me problems. Fix one problem then on to the next, then the next, then the next. Maybe later I'll restart the whole thing and take notes on the failures. Now if you'll excuse me, I have to find a small
animal to kick.

Update: I finally got it up. After more sweat and toil I found that it was probably just something really stupid on my part. Tomorrow night I'll have to make another image for the workers. Thanks for making the image Paul, you rock!

Technorati Tags: , , ,

July 18, 2007

A Distributed Parallel Ruby Feed Getter

Work continues on my attempt to make a decent Ruby abstraction for writing simple distributed, multiprocess, and multithreaded batch processing applications kind of like MapReduce (I outlined much of the effort in my posts about project Tahiti). I thought it might be interesting to look at what some code that uses the system looks like. Other than a little test code, the first thing I've written that will use the system is just a feed updater. It brings back a list of feeds from an AR backed model, downloads them, and updates them. Here's the code with a few comments to help point out the use of each piece (although if you haven't read my previous posts you may be lost anyway):

class FeedUpdater < MapReduce
  # had to do this so that feeds that hang can be killed.
  THREAD_TIMEOUT = 20
  # just a constant I use in multiple places.
  # it determines how many key/value pairs a chunk process will take care of
  CHUNK_SIZE = 200

  # The map calls are threaded in the chunk_processor.
  # No AR calls can be made in here. Just for the slower stuff
  def map(blog_id, blog_data)
    begin
      last_modified = open(blog_data[0], {"User-Agent" => "Tahiti +http://www.pauldix.net",
        "From" => "..."}) {|f| f.last_modified}
      if (!blog_data[1] or last_modified > blog_data[1])
        feed = FeedTools::Feed.open(blog_data[0], {:user_agent => 'Tahiti +http://www.pauldix.net'})
        output.emit(blog_id, [feed, last_modified])
      end
    rescue Exception => exception
      output.log("Exception: ", exception.message)
      output.emit(blog_id, :error)
    end
  end
 
  # What a DRb Worker machine sends to the chunk processor Process.
  # notice that it's not sending AR objects
  def self.get_key_value_pairs(chunk_number)
    Blog.find_chunk_to_update((chunk_number-1)*CHUNK_SIZE, CHUNK_SIZE).collect do |blog|
      [blog.id, [blog.feed_url, blog.last_updated_on]]
    end
  end
 
  # what the Master DRb calls before sending the different chunk numbers out to DRb Worker machines
  # This is for two things. One is to move data around to where it may be needed. Say out of the
  # production database and into some other place. Secondly, the count is for later so that an
  # estimate can be made of how many EC2 nodes to start up.
  def self.setup_chunks_and_get_total
    Blog.count_by_sql("SELECT count(*) AS count_all FROM blogs WHERE (update_failures < 5)")/CHUNK_SIZE
  end

  # this is the part where we can use active record.
  # This is in the master thread of the chunk processor
  def self.write_key_value(k, v)
    blog = Blog.find(k)
    if v == :error
      blog.update_attribute(:update_failures, blog.update_failures + 1)
    else
      blog.update_articles_from_feed(v[0], v[1])
    end
  end
end

Basically it's just the map phase of a MapReduce. There are some additional parts that help the system distribute out the work. The thing I like is that nowhere in this piece of code do I really have to worry about the details of threading, multiprocess, and multi-machine headaches. I need to test the part on ActiveRecord a little more to be sure that it works. I'm trying to get around the fact that it isn't mulithreaded by having only the main thread in the chunk process make AR calls.

As a stylistic side note, I think some bits of this are a little ugly and could probably be made a little more readable with a few Structs.

The MapReduce parent class provides a number of default options like how large chunks should be, how often the chunk processors should update the Master DRb on the progress, the estimated amount of memory each chunk process will use, and the number of threads that a chunk process should kick off to process the different key/value pairs.

Now I just have to get this thing deployed onto EC2 and see how it does. I'm sure there will be plenty of fixes and updates to the whole thing before it becomes stable enough to run a bunch of different tasks. I hope that it will generalize to other tasks because distributed computing abstractions seem to be leaky no matter what (leaky abstractions).

Thanks to Zed and Defiler for helping me think through how to get the multi-threaded stuff to work. Hopefully I'll be making a post about the multi-threading stuff in the ChunkProcessor pretty soon.

Technorati Tags: , ,

July 12, 2007

Notes on Ruby's DRb

Yesterday I described the basic design of a distributed Ruby system I'm implementing. One of the first parts of building this system is figuring out how the different machines are going to communicate and cooperate. Distributed Ruby (DRb) seems like the perfect choice for the task. Programming Ruby describes it as a simple, lightweight Ruby version of RMI or CORBA.

Unfortunately, the Pickaxe book is my only reference and it includes only the most basic example. I guess it will help me to lay out how I think things work, a few of the concerns I have, and my guesses at their answers.

First thing to take note of is that DRb objects are multithreaded. So if I start some object of mine running on port 1337, multiple remote processes can connect to it. Not only that, but they can make calls at the same time. To see some of the multithreaded action I created a few simple DRb scripts like so:

require 'drb'
require 'drb/observer'
class Worker
def initialize
@test = 0
end

def run(sleep_val)
sleep(sleep_val)
@test += 1
return @test
end
end
worker = Worker.new
DRb.start_service("druby://127.0.0.1:1337", worker)
DRb.thread.join
And a test script to call it:
sleep_time = ARGV[0].to_i
worker = DRbObject.new(nil, "druby://127.0.0.1:1337")
val = worker.run(sleep_time)
puts "test val: #{val}"
I bring up three console windows. In one I run the worker which now sits waiting to be called on port 1337. In another I kick off the test caller with an argument to make it sleep for 10 seconds. In the last I kick of the test caller with an argument to make it sleep for only a second.

Running that test I can see three things that I think are important. First is that both calls succeeded so the multithreaded nature of the DRb object works exactly as I would expect allowing multiple connections at the same time. Second is that the @test instance variable in the worker object is shared. There is one worker object that handles both requests and threads them off. This tells me that if I'm going to be doing anything with instance variables in DRb objects, I have to make them threadsafe using a Queue or some locking mechanism like Mutex or Monitor. The last thing is that the calls from the test script don't return until the remote worker finishes. So the .run calls don't just send the data out to the remote worker and go about their business, they wait for a response. I also tested for when the connection is made with a slight modification of the test script. The call to DRbObject.new doesn't actually connect to the remote worker. That doesn't occur until the .run method call.

All of this is incredibly boring and simple, but I need to take small steps to think about the problem. So here's a slight rehash of my structure for the different pieces. One Master process runs for each MapReduce that gets kicked off. One DRbWorker runs on each machine that is available to perform processing. The Master sends as many input Chunks to the DRbWorker as it can handle. For each Chunk processing request, the DRbWorker starts a new process that it detaches from. It hands that script a druby uri so the Chunk process can report back to the Master what its status is. This means the call the Master makes to the DRbWorker to process the Chunk returns  very quickly so the connection doesn't stay open (I'm assuming the connection is closed after the call returns, but I have to test this somehow).

This also means that the Master will have to start listening for incoming calls from the various Chunk processes before kicking them off to the DRbWorker. I was looking at using 'drb/observer' to have the DRbWorker send update notifications back to the Master, but I think it's easier to just have the Chunk process handle that. The DRbWorker is really just a dispatcher. It dispatches new processes into the machine that are requested from the Master.

I guess the only concern I have going forward is the issue of tracking which PIDs are responsible for processing each Chunk. I want the Master to be able to make a call to the DRbWorker to kill and/or restart a Chunk process if it thinks it has hung. I guess it's time to dig into launching and monitoring child processes.

Technorati Tags: , ,

July 11, 2007

Distributed, MultiProcess, MultiThreaded Ruby Hurts My Brain

I've been looking into spreading out some batch processing and background jobs over a group of machines. The trick is that I don't want to deal with the dirty details of writing multiprocess and multithreaded stuff more than once. I want to write an abstraction like MapReduce so I can then easily write code that runs distributed, but looks deceptively simple. The other catch is that I'm designing it to run on EC2.

I started out last week looking at Starfish as an option, but it didn't
quite meet my needs. It uses Rinda, which relies on UDP broadcasts. I'm
not even sure that will work in EC2 and in general it just seems like
too much of a chatty Kathy. So I decided to look at DRb and figure out
how to write my own stuff. Here's the outline of how I want it to work from a high level.

  • I write a basic MapReduce task (my design isn't exactly MR yet)
  • Run some cap deploy from my dev box that kicks off the following
  • Tells the main web server (running a Master) to start the number of EC2 nodes it determines it needs
  • Each EC2 node comes up and syncs the latest code with a "Worker" process waiting to be called
  • The Master sends the information on different input chunks out to each EC2 node through the Worker
  • The Worker kicks off a separate process for each chunk (each machine should be able to process multiple chunks at once)
  • The Chunk process gets the input chunk (in this case from the database) and splits the chunk into key/value pairs and threads out calls to map with each one
  • The threaded map calls return and the Chunk process notifies the Master (repeat for all chunks)
  • The Master turns down each EC2 node as it finishes processing all Chunks
This way I can run a bunch of different tasks all at the same time and only pay for the time that I use.

First is the MapReduce. When I write one it must have the following:
  • A class method determine how many chunks there are (needs AR access)
  • A class method to get each chunk and split into its key/value pairs (needs AR access and the outputted key/value pairs shouldn't be AR objects)
  • A map method
  • An attr_accessor for an output object that has an emit(key, value) method
  • A class method that takes a key/value pair from an emit and can use them to write data (needs AR access)
Notice how I flagged each thing in there that needs ActiveRecord access. That's because the Chunk process kicks off the map processing of each key/value pair in its own thread and ActiveRecord isn't threadsafe. I noticed a flag for ActiveRecord::Base.allow_concurrency, but that causes AR to open up a new connection for each thread and the whole thing just kind of scares me. If my fears are without merit please speak up. As it stands I have the Chunk process as the single AR connection and the gateway to that kind of interaction. That's where emit comes in. It tosses the values into a Queue which is threadsafe. As each thread terminates the main thread of the Chunk process clears out the Queue and calls to the MapReduce class method that knows what to do with the data.

The Worker is just a DRb process that listens for calls from the Master and kicks off new Chunk processes when instructed. The Master is the process that orchestrates the whole thing and will eventually be responsible for making sure each chunk gets processed. It is also a DRb listener since it has to receive updates from the Worker on each chunk.

I still have to work out some of the details since I've been testing out different things over the last week. Once I have the whole mess running on my development box I'll start working on getting it to work on EC2 (the real fun part).

Someone may tell me I should be using BackgroundDRb, but I'm not sure it will fit my needs either. I'm not looking at running just a few long running background tasks. I'm looking at running a bunch of stuff in parallel and with changing code and completely disconnected from the web server. Although at this point I'm so delirious that this may just be a really stupid design and overly complicated system.

Technorati Tags:

July 07, 2007

Keeping Tahiti Secret for TechCrunch20

Tahiti is the name of the project I've been working on for a while, but only recently gave a codename. A week ago I decided to start blogging about some of the different pieces without really giving away what the product is. One reason for the secrecy is because I don't really want to hype up something before its release. The blogging is about the technology and discussion with the Ruby community, not the product. The second reason is that I was considering submitting for the TechCrunch20 conference. Just in case you haven't heard of it here's a blurb from their site:

Twenty of the hottest new startups from around the world will announce and demo their products over a two day period at TechCrunch20. And they don’t pay a cent to do this. They will be selected to participate based on merit alone. In fact, we’re even offering a $50,000 cash award and lining up other in-kind services and awards from a generous group of corporate sponsors.
The deadline for submission was yesterday and I sent one in. I was hesitant to submit since I wanted to keep this project smaller scale in the beginning, but the opportunity was too much to resist. Because of the larger scale, I've recruited a long time friend and colleague to help out with the project. I referred to him as "The Adept" for a project I wanted to do a few years ago, but never got off the ground.

We won't find out if we've been selected as finalists until August 3rd, but this will definitely help to keep us motivated and on track. In the meantime I'll keep blogging about the coding and infrastructure process without giving away the idea.

Technorati Tags: ,

July 05, 2007

Thinking of using EC2

Over the last few days I've come to the realization that a single 256 MB VPS isn't going to cut it even for the prototype of my application. The problem is that I will have a number of background running tasks that can be CPU, memory, and network intensive and I just won't want those running on my web server. The issue is that I'm on less than a shoestring budget, so I can't really pay to have a few servers turned up all the time.

This is exactly why I'm thinking about EC2. The stats look pretty good. At least 1.76 GB of memory is much better than the 256 MB I have now. The concern I have is running my app and database only on EC2. I've noticed a bunch of chatter about this, but I think it's ok for my needs. I won't be storing mission critical data so if I just run a backup to S3 every few hours I should be fine. From what I've read on different blogs and forms, the service is pretty reliable so I shouldn't even have to worry about it (but of course I will).

Now for the costs. The average to keep one server turned up full time is about $73 a month before bandwidth. The bandwidth is super cheap so I'm not really worried about that. I'm currently paying $20 for my VPS so I could actually be running about 4 of the same VPS systems for around the same price. However, with EC2 I get a lot more memory which is important based on my tests over the last few days. The other advantage is that I can turn up more instances as needed and just shut them down when I'm done. This is very attractive to me right now since I know that I won't be running the secondary servers all the time and I can just pay for the hours I need them.

I have a few concerns about things that I'm pretty sure I already know the answer to. First, I'm worried about the storage on the actual EC2 systems. If I'm running a DB server on a node, is the storage on the EC2 node an actual connected hard disk? I know that it's volatile which is the reason for the backups to S3, but I just want to make sure that the DB calls to the file system aren't running over some freakish NFS. Second is the issue of how to point my DNS record to an EC2 system. What if my node goes down and I have to restore? From my understanding they are DHCP and I can't be sure that my IP will remain the same. Do I have to use dynamic DNS?

Does anyone out there with experience running apps exclusively from EC2 have advice or suggestions?

Since I got accepted into the beta a few weeks ago I think it's just time to play around with it and see what I can come up with.

Technorati Tags: , ,

July 03, 2007

Figuring Out Where Ruby is Spending My Memory

Yesterday I mentioned that Starfish and ActiveRecord were eating up memory. Today I decided to dig a little deeper and try to find exactly where my memory is being spent.  If I'm going to figure out how to run as many Ruby processes as possible on a single box I need to know the details. So I just wrote a couple of test Ruby scripts that each run a require on some piece of code that I use and then sleep so I can check memory usage.

I'm just using the command

ps v PID
Where PID is the process id. Here are the stats using the RSS number, which is basically the real or resident memory size in kb of the process (more info on determining memory usage with ps). I'm listing the figures in MB to make it a little more readable. The requires of each test are listed next to the figures.
  • 1.36 MB - no requires
  • 3.18 MB - require 'open-uri'
  • 4.35 MB - require 'rubygems'
  • 9.23 MB - require 'rubygems'; require 'hpricot'; require 'open-uri'
  • 17.82 MB - require 'rubygems'; require 'feed_tools'
  • 20.43 MB - require 'config/environment' # RAILS_ENV = 'production'
  • 27.15 MB - # all of above except for feed_tools
  • 28.80 MB - all of the above
So the Rails environment is certainly a memory hog, but it's not the only culprit. The unfortunate thing for me is that at a minimum for these processes to be useful, I have to run open-uri, rubygems, and hpricot. That means at least 9.23 MB per process and that's before I figure out how to offload their results (database, file, whatever). Doing that would require me writing a bunch of custom code instead of taking advantage of ActiveRecord and FeedTools' ease of use.

All of this is pointing me to the issue that one tiny 256 MB VPS may not be enough for what I want to do...


Technorati Tags: , , ,

Starfish and ActiveRecord love memory

Last night I was working on a part of Tahiti that requires some multi-process love. It's one of the pieces that gathers data. Basically just a feed consumer. The trick is that I want to be able to eat up a bunch of feeds in a short amount of time which is where multi-process comes in. I could just write some multi-threaded stuff, but I want to be able to scale to multiple processors and multiple machines.

I've never tried this stuff in Ruby so I started looking for options. Here's the quick rundown of buzzword names that I found in my search.

  • DRb - Distributed Ruby (Like RMI or CORBA)
  • Rinda - Ruby implementation of Linda (part of DRb for services and clients to auto-discover each other)
  • BackgrounDRb - job server and scheduler for running long running tasks (primarily for use with Rails)
  • AP4R - asynchronous processing for Ruby
  • Starfish - map reduce for Ruby

DRb and Rinda look to be the basic building blocks for the later three services. BackgrounDRb looked interesting although I'm not sure it's what I need. I'm not kicking off this thing from a Rails controller action, and I don't need to feed info back in to the front end. This is purely a background task that needs to run all the time. AP4R also looks interesting, but it seems most of the documentation is in Japanese.

After a quick look at Starfish I decided that it was what I would attempt to use to accomplish this task. It's advertised as a simple Ruby version of map reduce by its creator Lucas Carlson. I like how simple the interface is and it would be perfect for my needs if it weren't for one huge problem. The memory usage of the processes it kicks off are huge (something like 30 MB)! I assume this is because of ActiveRecord. Unfortunately this eliminates Starfish/ActiveRecord as an option since the VPS I'm setting the prototype up on has only 256 MB of RAM.

I have a few other observations on Starfish. It's really only the map part of map-reduce (which is still quite useful). Also, I don't quite understand how to start up multiple clients with it. The documentation says to just run:

starfish my_awesome_map_reduce.rb

I guess that starts the whole thing and then running that again will kick off additional clients. I need to dig into the starfish code to more closely understand things, but that didn't seem like a decent way to kick off multiple clients. The issue I had is that if you're kicking them off using a rake task, how can you be sure to start up the additional clients after the server has successfully started? Here's the super lame thing I did:

namespace :crawl do
  desc "Kicks off Starfish 3 times for feed_updater.rb"
  task :feed_updater => :environment do
    method(:fork).call { system("starfish lib/feed_updater.rb") }
    sleep(30)
    2.times do
      method(:fork).call { system("starfish lib/feed_updater.rb") }
    end
  end
end

Of course that was just to test things and now that I know these processes are whores for memory, the whole strategy is out anyway. I am curious how other people using Starfish kick off the additional clients.

So now I have to look for something else to accomplish this task. I'll probably end up using DRb and Rinda, but write my own custom code for the other stuff and skip using ActiveRecord. So much for easy distributed computing abstraction.

As a final aside, Lucas mentions how distributing the task of processing a log file slowed things down by 20x because of the problem of communicating and sending the data over the wire. I assume MapReduce would have this problem too if it weren't for the Google File System (GFS). One of the things the paper mentions is the importance of taking advantage of locality of data. Meaning that calls to process certain chunks of the input are put to machines that have a cached copy or are in the same rack as the machine that holds the data. The GFS paper also mentions data redundancy across the cluster which is something I would think might help MapReduce run faster.

Now it's back to the drawing board. If you have any suggestions or pointers to good reading materials about distributed computing in Ruby I would greatly appreciate it.

Technorati Tags: , , , ,

July 01, 2007

Giving My Project a Code Name

There's a little project I've been meaning to start work on for quite some time now. I had the idea last December and I've just been kicking it around and trying to learn a few things that might help out. The machine learning stuff is part of that effort.

I've decided that the machine learning code can't get better until I apply it to my own problem. So I've decided to start the project. I'll be blogging about that for a little while and to help make it more real I've chosen a code name: Tahiti.

Why Tahiti?

Well I wanted the name to help keep me motivated. I have a history of picking up projects for a little bit, getting bored with them and then unceremoniously discarding them. The motivation for the project is to not only make something cool and useful, but to hopefully make something that can contribute to my bottom line (read: help pay off my student loans). Since I thought "Underpass" (as in that's where you'll be living if you don't figure out a way to make a few bucks) was a little too negative I decided to go with a vacation destination (as in succeed with this and that's where you'll be able to go).

I have the basic ideas behind version 1 and now I just need to get cracking on the code. I'm starting this as a solo project since it's easier for me to prototype this way. Now if I can just stay motivated...

Tahiti...

Deployed Rails Stack on Slicehost

I picked up a Slicehost account a few months ago with the intention of launching a site some time this summer. I've been super busy with Google and trying to gemify my machine learning code so I haven't had a chance to start using it yet. Well the gem is mostly done and I'm awaiting a little code review from Trotter so I finally had a little time to get things set up.

It turned out to be easy as hell thanks to capistrano and deprec. Capistrano is the handy ruby library for automating deployment tasks and deprec is just a bunch of Capistrano recipes to install the full server stack. I won't cover the details since they're documented quite well in this deprec post by Craig Ambrose and this PeepCode screencast by Top Funky and this post specific to deprec on slicehost.

Thanks guys, anybody who makes installing a full server stack that painless is on my list of heroes.


Technorati Tags: , ,