One of the things I need to do in Filterly is keep many trained classifiers. These are the machine learning models that determine if a blog post is on topic (Filterly separates information by topic). At the very least I need one per topic in the system. If I want to do something like boosting then I need even more. The issue I'm wrestling with is how to store this data. I'll outline a specific approach and what the storage needs are.
Let's say I go with boosting and 10 perceptrons. I'll also limit my feature space to the 10,000 most statistically significant features. So the storage for each perceptron is a 10k element array. However, I'll also have to keep another data structure to store what the 10k features are and their position in the array. In code I use a hash for this where the feature name is the key and the value is its position. I just need to store one of these hashes per topic.
That's not really a huge amount of data. I'm more concerned about the best way to store it. I don't think this kind of thing maps well to a relational database. I don't need to store the features individually. Generally when I'm running the thing I'll want the whole perceptron and feature set in memory for quick access. For now I'm just using a big text field and serializing each using JSON.
I don't really like this approach. The whole serializing into the database seems really inelegant. Combined with the time that it takes to parse these things. Each time I want to see if a new post is on topic I'd need to load up the classifier and parse the 10 10k arrays and the 10k key hash. I could keep each classifier running as a service, but then I've got a pretty heavy process running for each topic.
I guess I'll just use the stupid easy solution for the time being and worry about performance later. Anyone have thoughts on the best approach?