Project Tahiti (recently dubbed Filterly) is a bit of a resource whore. Actually, it's worse than that. There are a few things that make it more resource heavy than a typical web application. My needs from a really high view are something like this:
- scalable data store
- web application layer
- crawling and feed parsing workers
- machine learning workers
A typical web application usually only has to worry about the first two items (and probably some basic background processing). On some apps the data store doesn't even need to be that scalable as long as you're using caching intelligently. With Filterly, the background workers can get quite expensive in terms of machine resources and stress on the data store.
In an ideal world I'd have a Google cluster handy, but I'm an impoverished college student so I have to make due with limited resources. Very limited. Here's my strategy for using EC2 to minimize my costs. Currently, my configuration looks something like this:
- 1 instance running a customized EC2 on Rails which includes database, web application, a beanstalkd queue, and a custom daemon written by me.
Total base cost: $73 per month plus a few bucks for some S3 backups. For now I'm just using the base EC2onRails backup strategy, which is a full backup every night with incremental backups every 10 minutes all going to S3. I'm planning on moving the database to flexible storage once it's out to the general public, but for now this will do. I'm not storing banking transactions.
The cool and thrifty part is the daemon. Every two hours it loads up the beanstalkd queue with jobs and starts up a custom EC2 instance based on a basic Ubuntu AMI. The instance then downloads the latest Filterly code from S3 and runs a script. The file starts up a bunch of worker processes that pull off the beanstalkd queue. The daemon monitors the queue until it's empty then shuts down the EC2 instance. I could have used SQS, but I like beanstalkd more. It's quicker than SQS, and is still feature rich with things like priorities, delayed jobs, and reservations. Most importantly: it's free since it's running on an instance I'm already paying for.
I have this run about six times a day so it adds 60 cents a day or roughly $18 per month to my total cost. This is perfect for my current stage of development (before private beta). Once I launch into private beta, I'll probably keep a worker instance running constantly to update the most trafficked topics and feeds.
The real value comes when I scale up the private beta and the index gets larger. It's a simple change in the daemon to make it launch 10 instances for crawling, updating, and machine learning just for an hour six times a day. So all of the topics in Filterly will get six updates per day for around $180 per month. Obviously, I'd like them updating constantly, but I think this is enough to keep it interesting and it's still affordable. The 10 instance number is completely arbitrary at this point since I'm not far enough along to be sure about how much resources these things will consume, but the idea is the same whether it's 2 or 200 instances.
Finally, this model hasn't addressed the scalable data store problem. I'll probably move over to HBase, CouchDB, or ThruDB later on, but in prototype mode I just need to get something up quickly.