For a while now I've been wanting to create a ruby library for feed (Atom, RSS) parsing. I know there are already some out there, but for one reason or another each of them falls short for my needs. The first part of this effort requires me to decide how I want to go about parsing the feeds. One of the primary goals is speed. I want this thing to be fast. So I decided to look into SAX parsing. Further, I wanted to use Nokogiri because I'm going to be doing a few other things with the posts like pulling out links and tags, and sanitizing entries.
SAX parsing is an event based parsing model. It fires events during parsing for things like start_tag, end_tag, start_document, end_document, and characters. Nokogiri contains a very light weight wrapper around libxml's SAX parsing. Using this I started creating a basic declarative syntax for registering parsing events and mapping them into objects and values. The end result is SAX Machine, a declarative SAX parser for parsing xml into ruby objects.
Some of the syntax was inspired by John Nunemaker's Happy Mapper. However, the behavior is a bit different. There are probably many more features I am going to add in as I use it to create my feed parsing library, but the basic stuff is there right now.
Here's an example that uses SAX Machine to parse a FeedBurner atom feed into ruby objects.
So in a few short lines of code this parses feedburner feeds exactly how I want them. It's also pretty fast. Here's a benchmark against rfeedparser running against the atom feed from this blog 100 times.
feedzirra 1.430000 0.020000 1.450000 ( 1.472081)
rfeedparser 15.780000 0.580000 16.360000 ( 16.768889)
It's about 11 times faster in this test. I have to do more tests before I feel comfortable with that figure, but it's a promising start. More importantly this will enable me to make a feed parser that's flexible, extensible, and readable.
SAX Machine is available as a gem as long as you have github added as a gem source. Just do gem install pauldix-sax-machine to get the party started (UPDATE: for some reason it's not built yet. looking into it. UPDATE 2: it's built and ready to go. thanks to nakajima). Also, special thanks to Bryan Helmkamp for helping me turn my initial spike into a decently tested and refactored version (even if it did slow my benchmark down by a factor of 4).
Very nice! I'm looking forward to seeing how this evolves!
Cheers,
John
Posted by: johnpg | January 17, 2009 at 11:59 AM
Great work! Are you gonna publish Feedzirra soon as currently the GitHub repo is empty?
Posted by: Nikolay Kolev | January 21, 2009 at 08:22 AM
Hi Nikolay,
Right now I'm playing around with the SAXMachine stuff a bit. I'll be working on Feedzirra after that. Hopefully I'll push an initial release to github in 2-3 weeks.
Posted by: Paul Dix | January 21, 2009 at 03:21 PM
Great! Can't wait to play with it! :-)
Posted by: Nikolay Kolev | January 21, 2009 at 11:16 PM