I've been looking into spreading out some batch processing and background jobs over a group of machines. The trick is that I don't want to deal with the dirty details of writing multiprocess and multithreaded stuff more than once. I want to write an abstraction like MapReduce so I can then easily write code that runs distributed, but looks deceptively simple. The other catch is that I'm designing it to run on EC2.
I started out last week looking at Starfish as an option, but it didn't
quite meet my needs. It uses Rinda, which relies on UDP broadcasts. I'm
not even sure that will work in EC2 and in general it just seems like
too much of a chatty Kathy. So I decided to look at DRb and figure out
how to write my own stuff. Here's the outline of how I want it to work from a high level.
- I write a basic MapReduce task (my design isn't exactly MR yet)
- Run some cap deploy from my dev box that kicks off the following
- Tells the main web server (running a Master) to start the number of EC2 nodes it determines it needs
- Each EC2 node comes up and syncs the latest code with a "Worker" process waiting to be called
- The Master sends the information on different input chunks out to each EC2 node through the Worker
- The Worker kicks off a separate process for each chunk (each machine should be able to process multiple chunks at once)
- The Chunk process gets the input chunk (in this case from the database) and splits the chunk into key/value pairs and threads out calls to map with each one
- The threaded map calls return and the Chunk process notifies the Master (repeat for all chunks)
- The Master turns down each EC2 node as it finishes processing all Chunks
First is the MapReduce. When I write one it must have the following:
- A class method determine how many chunks there are (needs AR access)
- A class method to get each chunk and split into its key/value pairs (needs AR access and the outputted key/value pairs shouldn't be AR objects)
- A map method
- An attr_accessor for an output object that has an emit(key, value) method
- A class method that takes a key/value pair from an emit and can use them to write data (needs AR access)
The Worker is just a DRb process that listens for calls from the Master and kicks off new Chunk processes when instructed. The Master is the process that orchestrates the whole thing and will eventually be responsible for making sure each chunk gets processed. It is also a DRb listener since it has to receive updates from the Worker on each chunk.
I still have to work out some of the details since I've been testing out different things over the last week. Once I have the whole mess running on my development box I'll start working on getting it to work on EC2 (the real fun part).
Someone may tell me I should be using BackgroundDRb, but I'm not sure it will fit my needs either. I'm not looking at running just a few long running background tasks. I'm looking at running a bunch of stuff in parallel and with changing code and completely disconnected from the web server. Although at this point I'm so delirious that this may just be a really stupid design and overly complicated system.
Technorati Tags: ruby drb ec2 activerecord multithreading
Comments