> Did you try compressing the ,v files? It seems likely that delta
> encoding and then general compression would do better than compressing
> each version individually.
I hadn't, but I have now, and you are right. In case anyone is interested,
here are the actual numbers. If anyone wants me to look at other options
(within reason) I'll be glad to do so. I'ld also be interested to run the
same analysis on other/larger repositories. Since I seem to have the space
at the moment, please let me know if you'ld be okay with having me duplicate
a complete repository in order to do the analysis.
In the EROS repository, there are a total of 34260 versions across 9858 RCS
files.
Output of du -sk . at the root of the tree:
Blocks Reduced to Description
67480 18.9% gzip -9 of RCS files
132594 37.3% RCS files
179276 50.4% bzip2 each version
184288 51.8% gzip -9 each version
355548 100% uncompressed
I started looking at this because I wanted to understand the value of delta
coding both on the wire and in the repository. These numbers may not
generalize to other projects, but here are my initial reactions to them:
The space savings from compressed files to uncompressed RCS is about 25%.
Perhaps it might in some cases be worth paying. Formats that separate every
version are more robust -- the fewer versions that you stream through memory
in each operation, the fewer versions there are to corrupt.
However, the further compressability of RCS files is impressive and
surprising. It suggests (to me) that there is huge amounts of repetition
across files in this repository, and that a compression scheme that built an
initial alphabet by looking across files might do considerably better than
gzip. This can certainly be done by an offline compressor.
If done by an offline compressor, the format can (with care) continue to be
self-describing in the style of normal compressed files. It's a question of
dictionary encoding.
In the context of wire transmission, this has advantages over deltas. If the
server ships the client a delta but the client doesn't have the base
version, then the whole thing needs to recurse and it soon becomes better to
have shipped the whole thing compressed in the first place. If the
compressed content is as efficient as the delta, it's better to ship
standalone compressed entities.
Indeed, one can thing of this as metadeltas -- the mutually shared
dictionary is in effect a preconditioned delta system.
Mostly just thinking out loud.
shap
Received on Sat Oct 21 14:36:07 2006