For a long time I have been disappointed in the Atom and RSS libraries available in Ruby. I always had some small issue like it wouldn't normalize the way I wanted it to, it wouldn't grab feeds the way I wanted it to, or it was just plain slow. So I set about creating yet another Ruby feed parsing library. The result of this effort is Feedzirra: a feed (Atom and RSS) fetching and parsing library for Ruby.
One of the primary goals of Feedzirra is speed. I wanted it to be able to fetch and parse many feeds quickly. Examples with libraries that show getting a single feed are great, but I'm going to be using it to get thousands. With that in mind I spent time looking for the fastest way to parse feeds (nokogiri and libxml) and the fastest way to get feeds (taf2-curb and libcurl). I also built in logic for updating feeds that speeds things up even more, saves on bandwidth costs, and makes it dead simple to see what's new in an updated feed.
A second design consideration was that I wanted the library to be extensible and customizable. This means that you can add custom parsers to Feedzirra to handle different feed types (like microformats, for example). What this also means is that if you find a bug in the parsing on a specific feed, you can write a parser (should take less than 20 lines of code) and use it with Feedzirra while you wait for me or a contributor to get the bug fix in. Feedzirra also allows you to define callback behavior after success or failure of fetching a feed.
There are few missing things like support for gzip and deflate encoding and Ruby 1.9.1 testing, but all the basic pieces are there and the others are soon to follow. It's ready for use. For more information and benchmarks against other methods you can look at the readme for Feedzirra on Github. If you have comments on the API or if you find bugs on feeds in the wild, please comment in the Feedzirra discussion group. I'm eager to see this thing in use and want to make sure that it's rock solid. Special thanks to Pat Nakajima for looking at the code and helping to refactor a bit of it.
In the meantime, here's a gist to give you an idea of what Feedzirra can do and what the interface looks like.
For Nokogiri:
sudo aptitude install libxml2
sudo aptitude install libxslt-dev
Feedzirra itself:
sudo gem install pauldix-feedzirra
After that still several errors, so:
sudo gem install curb
sudo gem install curl-multi
require "rubygems"
require "feedzirra"
feed = Feedzirra::Feed.fetch_and_parse("http://www.tvnzb.com/tvnzb_new.rss")
--> NameError: uninitialized constant Curl::Multi
from /var/lib/gems/1.8/gems/pauldix-feedzirra-0.0.1/lib/feedzirra/feed.rb:57:in `fetch_and_parse'
after that error I thought it might be an ok idea to require "curl-multi"
require "curl-multi"
/var/lib/gems/1.8/gems/curl-multi-0.2/lib/curl-multi.rb: In Funktion »perform«:
/var/lib/gems/1.8/gems/curl-multi-0.2/lib/curl-multi.rb:347: Warnung: Aufruf von »_curl_easy_getinfo_err_string« mit Attributwarnung deklariert: curl_easy_getinfo expects a pointer to char * for this info
/var/lib/gems/1.8/gems/curl-multi-0.2/lib/curl-multi.rb:350: Warnung: Aufruf von »_curl_easy_getinfo_err_long« mit Attributwarnung deklariert: curl_easy_getinfo expects a pointer to long for this info
/var/lib/gems/1.8/gems/curl-multi-0.2/lib/curl-multi.rb: In Funktion »add_to_curl«:
/var/lib/gems/1.8/gems/curl-multi-0.2/lib/curl-multi.rb:248: Warnung: Aufruf von »_curl_easy_setopt_err_write_callback« mit Attributwarnung deklariert: curl_easy_setopt expects a curl_write_callback argument for this option
=> true
But still:
feed = Feedzirra::Feed.fetch_and_parse("http://www.tvnzb.com/tvnzb_new.rss")
NoMethodError: undefined method `on_success' for #
Posted by: Marc | February 03, 2009 at 10:47 AM
it's only a feed reader library? you can't build a feed with it?
Posted by: Elad | February 03, 2009 at 10:47 AM
Marc,
You need to have libcurl installed. Also, it doesn't use curb or curl-multi. You must have the taf2-curb fork of curb. What were the errors thrown when you did a gem install pauldix-feedzirra?
Elad,
It's only a fetcher and parser. Generating feeds depends on what you're generating from. To generate you really only need a few lines of builder code. It's not something I'd even consider using a library for (other than builder or something like that).
Posted by: Paul Dix | February 03, 2009 at 11:27 AM
This looks great! I have a couple sites that need something like this.
Posted by: Daniel Higginbotham | February 03, 2009 at 12:22 PM
Congrats on the release, Paul. Those are some hot benchmarks
Posted by: Bryan Helmkamp | February 03, 2009 at 01:52 PM
looks great, but i can't get it working :
/Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `gem_original_require': no such file to load -- curb_core (LoadError)
from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `require'
from /Library/Ruby/Gems/1.8/gems/taf2-curb-0.2.3/ext/curb.rb:5
from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `gem_original_require'
from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `require'
from /Library/Ruby/Gems/1.8/gems/pauldix-feedzirra-0.0.1/lib/feedzirra.rb:5
from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:36:in `gem_original_require'
from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:36:in `require'
from test_feedzirra.rb:2
something missing in the gem ?
Posted by: julbouln | February 03, 2009 at 05:02 PM
julbouln,
That error is probably due to Mac Ports. I just added a note to the installation instructions on the readme. If you have Mac Ports and you have curl installed through there, you need to remove it. If you're on Leopard then you're ready to go. Otherwise, download the latest curl and build from source.
If you're not using Mac Ports, then my guess is still that you have an older version of curl. Clean it out and get the latest.
Posted by: Paul Dix | February 03, 2009 at 06:05 PM
thanks for the reply
I have mac ports installed, but have the leopard /usr/bin/curl version.
I tested on a linux box and got the same error
usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require': no such file to load -- curb_core (LoadError)
Posted by: julbouln | February 03, 2009 at 06:57 PM
It seems that installing taf2-curb with gem doesn't actually compile the lib,
doing it manually in the taf2-curb folder
ruby ext/extconf.rb;make;make install
does everything working fine!
Posted by: julbouln | February 03, 2009 at 07:54 PM
Paul, this looks really promising. Back in July I set out to rewrite hoards of FeedTools to use libxml and generally have a cleaner implementation, but I quickly realized that was going to be a full-time job. Well done!
Posted by: Sean | February 04, 2009 at 12:28 AM
If you want to remove that dependency on curb, which in my experience causes nothing but trouble (as the comments here seem to indicate as well), might I humbly suggest our http client library, Resourceful. http://resourceful.rubyforge.org/
It will handle all the other things for you; the conditional get, redirects, proxies, etc... Take a look, I'm trying to drum up some more interest in it.
Posted by: Paul | February 04, 2009 at 12:46 AM
You, sir, are awesome. I was just about to start writing my own because of the lack of choices out there, but now I'm following you on github. I'm sure I'll contact you again... hopefully with more praises.
Posted by: Josh Kim | February 04, 2009 at 01:20 AM
Hi Paul,
Nice work on the feed library. I'm still working on porting curb to ruby 1.9.1, assuming I get some free time between work this week, I'm hoping to make a new release early next week.
-taf2
Posted by: Todd Fisher | February 04, 2009 at 07:06 AM
To everyone reporting issues installing curb, can you send me the results of running ruby ext/extconf.rb, inside of the failed gem build dir... You can send me a message on github.com
thanks,
taf2
Posted by: Todd Fisher | February 04, 2009 at 07:10 AM
Hi Other Paul,
I would only remove the dependency on curb if I found something that was easier to install yet still kept the speed. I took a look at the Resourceful source on github and I see that it's using net/http. That's a deal breaker for me since I've written about net/http being too slow for my needs. Blocking IO and no keep-alive ruin the performance of any library built on top of it.
Hi Todd,
That's awesome that you're working on 1.9.1 compatibility. What about the issue of the gem not compiling on gem install? Is that some other issue on people's machines or just a quick fix to the gemspec? I'll put a note in the installation instructions to also let you know about curb problems. Let me know if I can help in any way.
Posted by: Paul Dix | February 04, 2009 at 10:37 AM
Great stuff, I'm really excited! Only problem, I can't install it :( I get as far as installing curb:
$ sudo gem install taf2-curb
...
Makefile:137: warning: overriding commands for target `/usr/lib/ruby/gems/1.8/gems/taf2-curb-0.2.4/ext'
Makefile:135: warning: ignoring old commands for target `/usr/lib/ruby/gems/1.8/gems/taf2-curb-0.2.4/ext'
/usr/bin/install -c -m 644 ./curb.rb /usr/lib/ruby/gems/1.8/gems/taf2-curb-0.2.4/ext
/usr/bin/install: `./curb.rb' and `/usr/lib/ruby/gems/1.8/gems/taf2-curb-0.2.4/ext/curb.rb' are the same file
make: *** [/usr/lib/ruby/gems/1.8/gems/taf2-curb-0.2.4/ext/curb.rb] Error 1
Any ideas?
Posted by: Chris | February 04, 2009 at 05:08 PM
Hi Chris,
You can try going into /usr/lib/ruby/gems/1.8/gems/taf2-curb-0.2.4/ext and run make. Otherwise, uninstall that gem and do a git clone git://github.com/taf2/curb.git then run rake gem in the curb directory, then sudo gem install pkg/curb-0.2.4.0.gem
Please let me know if that works.
Posted by: Paul Dix | February 04, 2009 at 06:33 PM
I successfully installed taf2-curb 0.2.6.1 by cloning and building locally. It doesn't help much though, since the feedzirra installation still fails when trying to build taf2-curb 0.2.4? Does it depend on 0.2.4 specifically?
Posted by: Chris | February 05, 2009 at 05:12 PM
Hi Chris,
I'm looking at the taf2-curb gemspec and it says version 0.2.4. Even so, the Feedzirra gemspec requires taf2-curb >= 0.2.4 so I would expect higher versions to work. Can you paste in the exact error?
Posted by: Paul Dix | February 05, 2009 at 06:39 PM
Hi,
Your library is really great, and it came exactly when I needed it, thanks a lot !
I've just one suggestion. On an atom feed the last_modified attribute gets the value of the field whereas there is a field which seems more appropriate (at least it is on the atom feed I have to parse frequently, because the lastBuildDate field changes every 5 minutes, yet there is no new items since the time referenced by the pubDate field).
Thibaut
Posted by: Thibaut | February 07, 2009 at 09:16 AM
Finally, I've got it installed :) I think the reason it failed the first time I built curb locally was that feedzirra wanted taf2-curb, but when I installed it locally it was called curb only. github gems isn't always so hot.
Will be playing around with feedzirra now, thanks for your help!
Posted by: Chris | February 07, 2009 at 04:48 PM
I'm curious about this in the readme:
This thing needs to hammer on many different feeds in the wild. I’m sure there will be bugs. I want to find them and crush them. I didn’t bother using the test suite for feedparser. i wanted to start fresh.
Given that rfeedparser is just a port of feedparser, doesn't it make sense to start with that suite as it's uncovered all of the well-documented nastiness with RSS?
http://diveintomark.org/archives/2004/02/04/incompatible-rss
Posted by: Sean Porter | February 09, 2009 at 01:32 PM
Sean,
I'm actually not opposed to converting those tests to go against Feedzirra. I just didn't want to take the time to convert all those little edge cases. I thought I would get further by just hitting the main cases.
The other thing is that Feedzirra isn't trying to be exhaustive on the elements in each of the feed types it parses. Exactly the opposite, actually. Feedzirra only wants elements that are common to all of the feed types. The funny thing is that even that is a little loose since some feeds claim to be RSS version whatever, but leave out elements like pubDate (I'm looking at you, RubyForge release feed.)
Ultimately, if someone wants to convert those tests to run against Feedzirra, I'd definitely take do a git pull.
Posted by: Paul Dix | February 09, 2009 at 03:19 PM
Hi Thibaut,
Last modified actually gets either the last published entry date or (if available) the last-modified from the response header. Some Atom feeds update the last modified (in the Atom reponse) when someone posts a new comment. I'm not quite sure how I should handle this. For now, I'm just keeping the single published date and updating the last_modified attribute on the Feed object with the response header. If the server doesn't include it then it's always just the last published entry date.
Posted by: Paul Dix | February 09, 2009 at 04:26 PM
Paul thank you for this gem i was able to get it installed and all of the dependencies after some struggles with a missing package on my local machine.
To people having errors, you might want to review this post: http://ruby.zigzo.com/2009/02/15/feedzirra-installation-errors/
Posted by: A Nobody! | February 15, 2009 at 02:03 PM