Thoughts on Serialization Formats

When evaluating a serialization format, how easy it is to evolve schemas (adding new fields, etc.) is as important as storage compactness and encoding/decoding performance, but is often overlooked.

This is a very nice comparison on schema evolution for Avro, Protocol Buffers, and Thrift. In short, Avro schema is harder to evolve than Protocol Buffers and Thrift. Reason: Protocol Buffers and Thrift objects are essentially a list of tagged values, where tags carry minimal schema information for a decoder to skip unrecognizable fields. On the other hand, Avro objects are raw values concatenated in a schema-defined order. So to decode an Avro object, you need both the current schema and the schema when it was encoded.

However, not all raw-values formats are inherently hard to evolve their schemas. The trick lies in how the fields are ordered. Cap’n Proto’s design should make it easy to evolve. (I never used Cap’n Proto, but I would imagine this claim is ture.)

I rarely find Python pickle or Ruby marshal useful. If you want to hack something quickly, plain-text formats like JSON or YAML is easier to work with (you may validate your date with a text editor). If you are working on something serious or for production, you probably should not use pickle or marshal. (However, Rails uses marshal for serializing ActiveRecord and session cookie, which in my opinion, is a really bad choice.)

JSON’s indifference between ints and floats are sometimes annoying. But YAML’s unintuitive interpretation of scalar values is even worse (hint: try yaml.load('[0755, Yes]') in Python).

Creative Commons License
This blog by Che-Liang Chiou is licensed under a Creative Commons Attribution 4.0 International License.