[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Space saving svn enhancements ...

From: Peter Samuelson <peter_at_p12n.org>
Date: Fri, 27 Mar 2009 11:26:28 -0500

[Magnus Torfason]
> Do people think that there may be realistic ways to leverage the
> knowledge that a particular file is actually a zipped/gzipped/tarred
> collection of files to reduce the cost of versioning changes to such files?

I believe it would be nontrivial to store an archive file as anything
other than its own literal bitstream, unless we could assume that
integrity of the archive file itself is not important, but only the
integrity of the content files is. Indeed, running 'tar' twice on the
same directory will sometimes get you a different ordering of files,
because that is determined semi-randomly by the readdir() function,
driven by your OS filesystem. All sorts of little timestamps and
permissions and file owners can change when you repack a tarfile, too.
But ... if you decide that owners, permissions, file order, and the
exact "flavor" of tarfile format, are all unimportant and the user will
not expect to get the same tarfile out that he checked in, then you
could optimize the storage as a directory hierarchy of some sort.

Same probably holds true for zipfiles - the exact bitstream will depend
on the deflate engine. Microsoft probably borrowed zlib for this
purpose like most of the rest of us did, but I see no reason to believe
different versions of zlib will produce bit-for-bit identical zipfiles
given the same input. The ZIP format stagnated long enough ago that I
suppose everybody's zipfiles are compatible with each other, but there
are lots of reasons to want your MD5 sums to match.

I should mention that the gzip '--rsyncable' flag, which at one time
was specific to Debian but may have proliferated further by now, is a
great way to make sure Subversion can efficiently store small deltas of
compressed content. --rsyncable produces fully backward-compatible
files, but it does cost maybe 1-2% of file size - in my opinion, an
acceptable cost. I don't suppose Microsoft Office's copy of zlib (or
whatever engine they use) has a similar option? Given that ZIP uses
the same compression algorithm, it should be possible.

-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/
Received on 2009-03-27 17:26:40 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.