Pardon an ignorant question:
I understand the in-memory restriction for the moment, but why are binary
properties hard? There is surely some obvious reason, and I'm missing it
entirely.
In that other project I'm not supposed to plug, I am handling binary data
as follows -- I describe this in case it may offer a simple way out.
First, the problem I ran into is that binary data and XML don't get along
without some form of encoding. This may be a different problem then the
one you are encountering.
The good news is that none of the popular hex encodings generate output
that conflicts with the problematic XML content characters. That is, none
generate (or at least none that I found) generate '<', '>', or '&'. So I
decided to use a hex encoding.
The question then became: "which hex encoding?" -- and I go into this
because there is something non-obvious implicated in the choice.
The obvious choices are base16 and base64, either with or without the
linebreaks in the output. I left the line breaks out because I was unclear
about whitespace handling in XML CDATA elements and many other things will
break first if large strings don't work correctly. Note, however, that
this means I can't use a line-based diff for transport or delta encoding.
I'm not convinced it was the right call, but in general one cannot rely on
XML encoders not to alter whitespace in a lot of places, so text base diff
is an unreliable mechanism for XML-based encodings.
Base64 interacts unfavorably with byte-based compression schemes. In
base64, 3 bytes get turned into 4. The encoding scheme skews the input
byte distribution seen by the compressor in such a way that the
distribution looks more like letter pair distributions than letter
distributions. Worse, the letter pair distributions are position
sensitive, so there is even less redundancy to work with in a
byte-at-a-time compressor. Since the distribution of letter pairs is
relatively sparse and has few re-occurances, base64 does not compress as
effectively as simple hex encoding when the compressor reads the input a
byte at a time.
None of this should matter for the zip class of algorithms, which compress
bitstrings. In fact, both encodings should compress more or less
identically, because at the bitstring level the entropy and distribution
of the input has not changed at all.
If I have it right, what this means is that:
1. There is no compression *disadvantage* in terms of real file sizes in
the store or the workspace to using the base16 encoding. [This assumes
that the store contents will at some level be compressed with a zip-class
algorithm.]
2. It may provide flexibility for later to choose the base16 encoding. I
can't see why one would want a byte-based compressor, but conversely I see
no reason to penalize one unless there is some reason to do so. Also, it's
slightly easier to decode base16 by hand when using a debugger.
The only issue I really see is that the base16 representation occupies
more space in the interim buffers (i.e. before being converted back to the
original binary form). My own thought is that one should really view the
base16 or base64 as an intermediate form -- that is, one should imagine it
as a filter stuck directly in front of the read/write stream. If things
are handled this way, then there is no substantive in-memory difference in
cost either.
Not sure any of this is helpful to the issue at hand, but saw no reason
why the thought trail should need to be repeated.
Jonathan
Ben Collins-Sussman wrote:
> For now, because properties are still "toys", I think it's okay to
> list the following restrictions on them:
>
> * they must not be too large; each property name/value pair must
> easily fit into memory
>
> * they must not be binary data.
>
> Later on we'll implement them with vim and vigor, just like
> text-deltas.
Received on Sat Oct 21 14:36:12 2006