Last night at the NYC Ruby hackfest, I got into a discussion about serializing data. Brian mentioned the Marshal library to me, which for some reason had completely escaped my attention until last night. He said it was wicked fast so we decided to run a quick benchmark comparison.
The test data is designed to roughly approximate what my stored classifier data will look like. The different methods we decided to benchmark were Marshal, json, eval, and yaml. With each one we took the in-memory object and serialized it and then read it back in. With eval we had to convert the object to ruby code to serialize it then run eval against that. Here are the results for 100 iterations on a 10k element array and a hash with 10k key/value pairs run on my Macbook Pro 2.4 GHz Core 2 Duo:
user system total real
array marshal 0.210000 0.010000 0.220000 ( 0.220701)
array json 2.180000 0.050000 2.230000 ( 2.288489)
array eval 2.090000 0.060000 2.150000 ( 2.240443)
array yaml 26.650000 0.350000 27.000000 ( 27.810609)
hash marshal 2.000000 0.050000 2.050000 ( 2.114950)
hash json 3.700000 0.060000 3.760000 ( 3.881716)
hash eval 5.370000 0.140000 5.510000 ( 6.117947)
hash yaml 68.220000 0.870000 69.090000 ( 72.370784)
The order in which I tested them is pretty much the order in which they ranked for speed. Marshal was amazingly fast. JSON and eval came out roughly equal on the array with eval trailing quite a bit for the hash. Yaml was just slow as all hell. A note on the json: I used the 1.1.3 library which uses c to parse. I assume it would be quite a bit slower if I used the pure ruby implementation. Here's a gist of the benchmark code if you're curious and want to run it yourself.
If you're serializing user data, be super careful about using eval. It's probably best to avoid it completely. Finally, just for fun I took yaml out (it was too slow) and ran the benchmark again with 1k iterations:
user system total real
array marshal 2.080000 0.110000 2.190000 ( 2.242235)
array json 21.860000 0.500000 22.360000 ( 23.052403)
array eval 20.730000 0.570000 21.300000 ( 21.992454)
hash marshal 19.510000 0.500000 20.010000 ( 20.794111)
hash json 39.770000 0.670000 40.440000 ( 41.689297)
hash eval 51.410000 1.290000 52.700000 ( 54.155711)
Nice post, Paul.
One issue I ran into with Marshaling data into the database this week was related to character encoding. I never totally nailed down the specific problem, but I had to base64 encode the Marshal output in order to safely store it in a TEXT column.
-Bryan
Posted by: Bryan Helmkamp | August 27, 2008 at 10:20 PM
I heard that the yaml lib may be removed from ruby1.9 as it has no maintainer. Perhaps a new maintainer could start as a rewriter, and code the lib in C?
Posted by: Dr Nic | August 27, 2008 at 11:41 PM
I think the current YAML parser is Syck, which is a Ragel parser written in C. If that's the case I'm not sure how its speed could be improved.
Posted by: Paul Dix | August 28, 2008 at 11:57 AM
What about XML serialization? I don't think it might be faster than marshal but I'd like to see it in the results.
Have a nice day and thank you for this great post,
Lukas
Posted by: Lukas Rieder | September 23, 2008 at 10:40 AM
why the lucky stiff wrote Syck and it's been a part of Ruby since 1.8.0.
however, it could definitely be faster! Syck might not be the bottleneck - the Syck page says Syck is hella fast - but I've seen Ruby YAML go slow as hell on my password gem. first thing I did when I read this post was make a note to switch password from YAML to Marshal.
maybe you have to explicitly invoke Syck to get faster YAML. I don't know. very weird that why brags about its speed yet users complain about its not-speed.
Posted by: Giles Bowkett | October 14, 2008 at 01:26 AM