One of the things I need to do in Filterly is keep many trained classifiers. These are the machine learning models that determine if a blog post is on topic (Filterly separates information by topic). At the very least I need one per topic in the system. If I want to do something like boosting then I need even more. The issue I'm wrestling with is how to store this data. I'll outline a specific approach and what the storage needs are.
Let's say I go with boosting and 10 perceptrons. I'll also limit my feature space to the 10,000 most statistically significant features. So the storage for each perceptron is a 10k element array. However, I'll also have to keep another data structure to store what the 10k features are and their position in the array. In code I use a hash for this where the feature name is the key and the value is its position. I just need to store one of these hashes per topic.
That's not really a huge amount of data. I'm more concerned about the best way to store it. I don't think this kind of thing maps well to a relational database. I don't need to store the features individually. Generally when I'm running the thing I'll want the whole perceptron and feature set in memory for quick access. For now I'm just using a big text field and serializing each using JSON.
I don't really like this approach. The whole serializing into the database seems really inelegant. Combined with the time that it takes to parse these things. Each time I want to see if a new post is on topic I'd need to load up the classifier and parse the 10 10k arrays and the 10k key hash. I could keep each classifier running as a service, but then I've got a pretty heavy process running for each topic.
I guess I'll just use the stupid easy solution for the time being and worry about performance later. Anyone have thoughts on the best approach?
Hi Paul,
My masters project is something similar to filterly. I'll work with hierarchical text classification.
I'm about to start implementing the classifiers and I'm thinking in forking basset or maybe extending it.
I don't know a better approach than yours. I'll probably start with yaml/text files and then rethink a database model.
Posted by: Hugo | August 07, 2008 at 05:32 PM
Hey Hugo,
Yeah, I haven't touched Basset in a while. I guess your interest is a signal that I should finally move it to github and update with some new algorithms. Guess I should get on that asap.
Posted by: Paul Dix | August 07, 2008 at 06:16 PM
Hi, great post. I have been working with the Classifier gem ,and it has been working great from CL. However I want to use it in a rails app, and because Rails overrides SUM I cannot get it to work properly. Can you suggest something to get this working in Rails, or any other classifier?
Thanks! Keep it up, I enjoy reading your posts.
Marco
Posted by: Marco Kotrotsos | August 08, 2008 at 11:40 AM