This is all very insightful and informative. For fun, I threw together a
quick script which commits a series of extremely minor changes to a MS
Word file and monitors how the repository size evolves. I then added
the following lines to the script to commit not the original Word file
but an unzipped and tarred version. I used the following command line to
unzip and tar:
mkdir ziptar
(cd ziptar && unzip ../File.docx && tar cvf ../File.docx.tar ./*)
rm -rf ziptar
Here is what I get:
# Original file
Revision 1. 174 KB
Revision 2. 231 KB (delta 57K)
Revision 3. 304 KB (delta 73K)
Revision 4. 377 KB (delta 73K)
# With unzipping and tarring applied
Revision 1. 158 KB
Revision 2. 163 KB (delta 5K)
Revision 3. 172 KB (delta 9k)
Revision 4. 177 KB (delta 5k)
So significant (10X) space savings; with larger documents with heavy
imagery the ratios would probably increase. And the second half of the
zip-tar round-trip would of course need to be implemented in a hook at
the right time.
But as Vincent noted, this is not really a satisfying solution. Later
versions of the office package may be sensitive to the difference that
the zip-tar-zip round-trip introduces. I could see doing this to
facilitate frequent intermediate commits of large long-lived documents.
But I would probably never feel safe about it if I didn't commit the
original at some intervals as well.
In the end, the only satisfying long-term solution would be an efficient
delta-calculation between the two compressed representations, which
would probably require the relevant office packages to use some sort of
rsync-aware (or rsync-compatible) compression
On 3/6/2013 5:41 AM, Vincent Lefevre wrote:
>
> Moreover even when the users know that the exact bit pattern of the
> compressed file is not important at some time, this may no longer
> be true in the future. For instance, some current word processor may
> ignore the dates in zip files, but future ones may take them into
> account. So, you need to wonder what data are important in a zip
> file, including undocumented ones used by some implementations (as
> the zip format allows extensions). Taking them into account when it
> appears that these data become meaningful is too late, because such
> data would have already been lost in past versions of the Subversion
> repository.
>
Received on 2013-03-06 19:55:43 CET