[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FSFS format7 and compressed XML bundles

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Thu, 28 Feb 2013 21:08:58 +0100

On Thu, Feb 28, 2013 at 5:04 PM, Magnus Thor Torfason <
zulutime.net_at_gmail.com> wrote:

> Hey all,
>

Sorry that I have to disagree with what most people said.
I guess, Mark got closed to the what the current intend is.

I've been following the discussion about FSFS format7, and had a question:
> Is there any chance that the format would improve storage efficiency for
> documents that are stored as compressed (zipped) bundles of XML files and
> other resource files (Read MS Office Documents, but OpenOffice is similar).
>

Yes, exactly that: There is a *chance* that those will be
stored more efficiently. The thing about this format is that
is they are ZIP-compressed file trees with each file being
something like an embedded picture, the main text body,
the template etc.

ZIP - in contrast to .tar.gz - compresses each of these files
individually and then mainly concatenates them into the
result file. As long as you don't change the template or
any of the existing pictures, for instance, larger parts of
the file should remain unchanged. PowerPoint presentations
are probably the ones that benefit most from this scheme.

Format7 will (hopefully) be able to deal with a few 100kB of
inserted / removed data and still find all matching regions.
This is exactly what we expect from office files: changes
should affect some of the opaque data blocks but leave
other ones alone.

I'm finding that making very small changes in big documents (with embedded
> images) results in rapid growth of the repository, since the binary diff
> algorithm seems to not be able to figure out efficient deltas for this type
> of documents, even though analysis of the contents shows that they are
> almost unchanged.
>

In line with what others already said for this: there will be
no format-specific delta algorithms. This would make SVN
susceptible to attacks by manipulated user data (think of
all the security issues that stem from invalid pictures or
zip files).

The furthest that we might go (not planned, though) is to
have a set of alternative generic compression strategies
plus an equally generic way to choose the best suitable
one among them. Again, that is not planned for format7.

> This may be outside the scope of format7, but I thought I'd ask the
> question nevertheless.
>

No, it's right on the spot. But there will only be general
algorithmic improvements that "happen" to help in your
case.

There is another idea that I had concerning efficient
storage of office files: Templates and corporate ID data
should result in long, identical sub-sections that can be
found in many files. We might be able to identify these
common blocks and store them only once. So far, I
haven't tagged this idea with a target version.

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
*
http://www.wandisco.com/subversion/download
*
Received on 2013-02-28 21:09:34 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.