[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: improving subversion treatment of compressed XML/text file formats

From: Benjamin Smith-Mannschott <bsmith.occs_at_gmail.com>
Date: Sat, 25 Oct 2008 17:11:49 +0200

On Oct 24, 2008, at 20:44, Henrik Sundberg wrote:

> On Fri, Oct 24, 2008 at 8:35 PM, Benjamin Smith-Mannschott
> <bsmith.occs_at_gmail.com> wrote:
>>> Just one question about merging with ODT-like documents: not being
>>> able
>>> to merge would leave things no worse than binary docs, right? By
>>> this I
>>> mean that SVN does not try to merge changes in binaries currently.
>>
>> ODT-like documents are binary documents. Period.
>
> I disagree. Svn handles diffs in binary files. Compressed binaries are
> different.

On the client-side it makes sense to distinguish between "text" and
"binary". The former is composed of lines of mostly ascii characters
of reasonable length such that the unix commands diff and merge or
analogs can be used profitably. The latter is not amenable to such
treatment.

This is the distinction I was making when I called ODT-like documents
"binary".

On the server-side the distinction is a different one. The server
doesn't care the file is textual, in the client sense. The binary
differences (deltas) used on the server side just consider files they
store as a sequence of bytes without structure. (The deltas used by
the sever don't resemble the line-based diffs that a command like svn
diff will spit out.)

Because of this, the important distinction is between "delta-
friendly" [1] and "delta-hostile" files.

[1] I'm just making these words up. Anyone got a better suggestion?

Delta-friendly files exhibit local changes when they are edited. That
is, change a few words in your MS Word document, and a few opaque
blocks of bytes in the resulting file will be changed relative to the
previous version. Most of the file will remain as it was.

Delta-hostile files are ones where a trivial edit may make many
changes through the file, possibly even every byte in the file.
Because of the way most compression algorithms work, compressed files
are a good example of delta-hostility. (In fact they are something of
a worst-case: not only does the server have to store the full content,
it can't even perform compression on the content because said content
is already compressed.)

Compression doesn't have to be involved, however, for a file to be
fairly delta-hostile. Consider the XMI (a sort of XML dialect for UML
diagrams) produced by tools like Rational Modeler, these tend to
produce large deltas against the previous version because they make
heavy use of seemingly randomly generated IDs, which change every time
the file is saved. They also don't seem to write sub-elements out in a
consistent order.

Not all compressed files are necessarily maximally delta-hostile.
Consider a JAR. These are typically composed of many individual class
files. Because of the way the ZIP format (on which JAR is based)
works, each of these files is compressed separately. There's every
reason to expect that those classes which don't change from one
revision of the jar to another will still compress to the same stream
of bytes. Only those class files which have changed will be
(completely) different. There is still ample opportunity in such a
scenario for finding deltas on the server-side.

Hope that's clearer
// Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: users-help_at_subversion.tigris.org
Received on 2008-10-25 17:12:15 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.