Wednesday, April 7, 2010

Serialization

I was going to write up something on serialization, but when I started researching it I ended up falling down a rabbit hole. Nearly everything that could be said (whether right or wrong) has been. I'll just touch on a couple of points.

By abstracting out the durability layer, we can ignore (for the moment) the issues of data integrity and compression. We'll assume that the durability layer will provide a suitable level of assurance that the data returned by a call to read-bytes is exactly the same as the data previously written. Furthermore, we'll assume that the durability layer will compress the data if it has a need to.

For our purposes, we only need to serialize data in order to turn it into bytes that can be durably stored. There is one thing to keep in mind, though. We don't know how long we may end up keeping this data. We don't want future readers of the data to have to make too many assumptions about what it is we serialized. We should explicitly tag every datum we store with meta-data that tells us what it is.

Primitive data doesn't need much in the way of metadata. There should be a tag indicating the primitive type and some indication of the version of the serialization format. You might even dispense with the version tag (how many different ways will you be storing integers, anyway?). If you choose a reasonably standard serialization format, it is much more likely that you'll be able to get at your data in the future even if you lose all the information about how it is encoded.

Exercise 2: Pick a binary encoding format and implement an encoder and decoder for these Lisp/Scheme data types: integer (union of fixnum and bignum), character (Unicode code point), float (IEEE 754 double at the very least), UUID (RFC 4122), boolean, rational, complex, string (use UTF-8 unless you have a good reason not to). The decoder should be able to take any sequence of bytes generated by the encoder and return an object equivalent to the input (for an appropriate selection of equivalent). Don't do symbols or cons-cells yet.

For ChangeSafe, I used a single byte tag to encode the primitive serialized data. The tag was chosen by finding a mnemonic ascii code for the data:
(defconstant serialization-code/fixnum    (char-code #\#))
(defconstant serialization-code/keyword   (char-code #\:))
(defconstant serialization-code/string    (char-code #\"))

This allowed me to read serialized data into Emacs and have a few visual hints as to what it meant.

Tip: When you set up an enumeration like this, it is a good idea to define code 0 and code 255 explicitly as illegal values. This will catch a whole slew of accidentally uninitialized data.

4 comments:

John Cowan said...

Student A says: "Given that I don't have to care about compression, I make the following choices: My binary format is UTF-8 plain text, my encoder is WRITE, and my decoder is READ — okay, with a filter in front to reject invalid input, if you insist. UUIDs are supported via SRFI 10."

Joe Marshall said...

That'll work. (Question 2a: What are the drawbacks?)

kbob said...

Given that you haven't told us much about the intended use of this storage mechanism, I vote with Student A and John. Binary formats are lost much more quickly than text.

The drawbacks are primarily performance related, and since the use of the storage isn't known yet, it's premature to optimize.

John Cowan said...

READ is more general, so it costs extra time compared to a dedicated parser (but not much, and it already has lots of extensions I am going to want later). WRITE is also more general, but that just costs code space.

Compression is still in the hands of the durability layer, which can map to binary or whatever it wants.