[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: improving subversion treatment of compressed XML/text file formats

From: Benjamin Smith-Mannschott <bsmith.occs_at_gmail.com>
Date: Fri, 24 Oct 2008 20:35:01 +0200

On Oct 24, 2008, at 19:04, David Kaplan wrote:

> On Fri, 2008-10-24 at 18:28 +0200, Benjamin Smith-Mannschott wrote:
> > SVN Repo Space Efficiency when Edited often:

> > format space efficiency merge-friendlyness
> > ============= ================ ==================
> > plain text very good very good
> > html very good good
> > flat ODT: good poor [1]
> > msword doc acceptable impossible [2]
> > msword docx poor impossible [2]
> > ODT poor impossible [2]
> > ---------------------------------------------------
> > [1] This format isn't widely supported (a pitty, really).
> > [2] SVN will not and should not attempt to merge these
> > formats as they are not textual. Microsoft-word and
> > OpenOffice do contain features allowing a user to
> > perform merges independently of svn within the tool,
> > it's just that they'd have to do this "by hand" for
> > every merge conflict.
> > ===================================================

> Just one question about merging with ODT-like documents: not being
able
> to merge would leave things no worse than binary docs, right? By
this I
> mean that SVN does not try to merge changes in binaries currently.

ODT-like documents are binary documents. Period.

> So
> one could still get advantages in space by doing some sort of
> uncompress-diff process as I previously suggested, but one would
just be
> left without the possibility of merging. Is this correct?

Correct. The entries (files) in a ZIP can be compressed or not. If
there was a way to convince OpenOffice to just store its XML parts to
this zip without compressing them this would be enough to allow
subversion's binary server-side binary diffing algorithm to take
advantage of similarities between file revisions and thus save space.

On Oct 24, 2008, at 19:04, David Kaplan wrote:
> Also, why is html better at merging than flat ODT? I would imagine
that
> any XML-like format would have problems with blind merging.

I'm continually surprised how little research has apparently been done
on the relative mergability of various formats. The traditional
3-way-merge operates over a simple sequences of values (lines of
text), identifying common regions and differences. This works okay
for simple formats but breaks down with formats which have complicated
invariants.

HTML, particularly hand-edited is often simple enough in structure to
allow merging. It's also simple enough for human to fix, at the source
level should there be a conflict. flat ODT? not so much.

For a more detailed take, let me include an excerpt from a feature
proposal I wrote recently for the svn pre-commit hook script at my
place of work:

B. Smith-Mannschott Wrote, in "Subversion Hookscript Requirements
Proposal":

| * Is XML textual or not?
|
| XML presents yet another complication. It should also most properly be
| application/xml. A mime-type of text/xml would have to specify its
| encoding with a charset on svn:mime-type and this would be redundant
| for XML, since it carries its encoding information in-band.
|
| However, using application/xml for all XML files would be problematic
| because it would prevent subversion from attempting to merge
| changes. Even more annoyingly, it would make it unecessarily
| cumbersome to view changes (svn diff).
|
| One size does not fit all. Some XML files will be edited by hand and
| are structured simply enough that line-based merging has a good chance
| of success and line-based diff produces a useful comparison. Examples
| include XML schemas, XML transformations, maven project object models,
| XHTML pages and many more.
|
| Other XML files are huge dumps of hideously complicated data
| structures with all kinds of complex and undocumented
| constraints. These should be treated as if they were binary files and
| given an application/xml mime type. These kinds of files are rarely
| seen in source form and virtually impossible to diff and merge
| successfully. XMI (Rational Modeler UML) files fall in this category,
| as do the "flat" variants of the OpenOffice.org file formats.

// Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: users-help_at_subversion.tigris.org
Received on 2008-10-24 20:35:25 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.