[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Space saving svn enhancements ...

From: Hyrum K. Wright <hyrum_wright_at_mail.utexas.edu>
Date: Fri, 27 Mar 2009 11:36:33 -0500

On Mar 27, 2009, at 11:26 AM, Peter Samuelson wrote:

>
> [Magnus Torfason]
>> Do people think that there may be realistic ways to leverage the
>> knowledge that a particular file is actually a zipped/gzipped/tarred
>> collection of files to reduce the cost of versioning changes to
>> such files?
>
> I believe it would be nontrivial to store an archive file as anything
> other than its own literal bitstream, unless we could assume that
> integrity of the archive file itself is not important, but only the
> integrity of the content files is. Indeed, running 'tar' twice on the
> same directory will sometimes get you a different ordering of files,
> because that is determined semi-randomly by the readdir() function,
> driven by your OS filesystem. All sorts of little timestamps and
> permissions and file owners can change when you repack a tarfile, too.
> But ... if you decide that owners, permissions, file order, and the
> exact "flavor" of tarfile format, are all unimportant and the user
> will
> not expect to get the same tarfile out that he checked in, then you
> could optimize the storage as a directory hierarchy of some sort.
>
> Same probably holds true for zipfiles - the exact bitstream will
> depend
> on the deflate engine. Microsoft probably borrowed zlib for this
> purpose like most of the rest of us did, but I see no reason to
> believe
> different versions of zlib will produce bit-for-bit identical zipfiles
> given the same input. The ZIP format stagnated long enough ago that I
> suppose everybody's zipfiles are compatible with each other, but there
> are lots of reasons to want your MD5 sums to match.
>
> I should mention that the gzip '--rsyncable' flag, which at one time
> was specific to Debian but may have proliferated further by now, is a
> great way to make sure Subversion can efficiently store small deltas
> of
> compressed content. --rsyncable produces fully backward-compatible
> files, but it does cost maybe 1-2% of file size - in my opinion, an
> acceptable cost. I don't suppose Microsoft Office's copy of zlib (or
> whatever engine they use) has a similar option? Given that ZIP uses
> the same compression algorithm, it should be possible.

Although I can see the usefulness of it, I don't think this is
Subversion's problem, for the reasons you mention above. Subversion
is about versioning bits. We make special exceptions for presenting
stuff that looks like text, since it is easy to deal with, and is
common for developers, but under the hood, it's still about the bits.

That being said, Subversion could grow the ability for third parties
to handle various kinds of bitstreams in a different ways.

-Hyrum

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1445739
Received on 2009-03-27 17:36:51 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.