[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FSFS format7 and compressed XML bundles

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Thu, 28 Feb 2013 19:53:39 +0000 (GMT)

Ben Reser wrote:

> Speaking with Julian here at ApacheCon he mentioned that gzip has a
> rsyncable option.  Looking into this turns out that there is a patch
> applied to Debian's gzip that provides this option.  It resets the
> compression algorithm every 1000 bytes and thus makes blocks that can

Use of such a zip format would be ideal -- Subversion's binary-delta would then calculate an excellent delta as long as each inserted chunk is are smaller than the delta window size (currently 100 KB, Stefan's proposal 1 MB).

I'm not sure about the details of how the restartable compression works, but it somehow selects points in the uncompressed data that don't depend on the absolute byte offset from the start of the file, and resets the compression at those points.

As I understand it, only the compressor needs the special logic, and the resulting compressed file is still in the same format and fully compatible with the standard decompression libraries.

But unfortunately although patches for this "restartable" or "rsyncable" mode of compression has been around for years, and it can have a very low overhead, nevertheless it doesn't yet seem to have been implemented in the common compression libraries (such as zlib), and OpenOffice doesn't offer that mode.

Therefore this is not a practical solution at the moment.

> be saved between revisions of the file.  gzip uses the same DEFLATE
> algorithm that most zip files use, so the same idea could be applied
> to it.  If we want to deal with something like this in Subversion, I
> think we'd deal with it via some sort of plugin for specific file
> types that could convert to the more efficient to deltify encoding
> before committing.  Unfortunately, we don't have any sort of plugin
> type infrastructure for this today.

Yes, a client-side plug-in -- either to Subversion or to OpenOffice -- seems to me the best practical solution.

There exists a plug-in to OpenOffice, "OOoSVN", which, when you want to commit the current version of the doc that you are editing, uncompresses the doc file into a tree of files in its own private svn working copy (that it creates in your home directory) and commits that.  Similarly, to update your doc to an old version, or to retrieve two versions and diff them, it updates this hidden WC and then compresses the files in the WC into a ".odt" or whatever, and lets OpenOffice load or diff that file.

I have tried "OOoSVN" and it works but it is very crude -- the user interface is poor and it is not flexible -- it only supports a local dedicated svn repository, for example.

> Even still there are things that can be done today.  I made a couple
> trivial Microsoft Office Word documents.  One with the characters
> "abc" in them and one with "abcdef" in it.  I saved the
> files in .docx
> and in the 2003 flat XML format.  The .docx file produced a delta of
> 3262 bytes, the .xml format produced a file with a delta of just 358
> bytes.
>
> OpenOffice/LibreOffice support flat versions of their format (e.g.
> .fodt) that are not compressed and can also be more efficiently stored
> in Subversion.  LibreOffice even has a wiki about this:
> https://wiki.documentfoundation.org/Libreoffice_and_subversion

We should talk to the OpenOffice folks and see if we can convince them of the value of using a restartable compression by default, and find out how possible that is.  It would be great if that Wiki page could even say, "We'd like to use restartable compression for this reason but we need the compression library developers to make it available."

But for a practical solution until restartable compression becomes the norm (if it ever does), if you (Magnus) would like to help by designing some kind of solution, that would be great.  Please do keep discussing it here if you have any thoughts in this direction.  FWIW I think it's an important and interesting issue.

- Julian
Received on 2013-02-28 20:54:21 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.