[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FSFS format7 and compressed XML bundles

From: Mark Phippard <markphip_at_gmail.com>
Date: Thu, 28 Feb 2013 11:28:46 -0500

On Thu, Feb 28, 2013 at 11:25 AM, Branko Čibej <brane_at_wandisco.com> wrote:

> On 28.02.2013 08:04, Magnus Thor Torfason wrote:
> > Hey all,
> >
> > I've been following the discussion about FSFS format7, and had a
> > question: Is there any chance that the format would improve storage
> > efficiency for documents that are stored as compressed (zipped)
> > bundles of XML files and other resource files (Read MS Office
> > Documents, but OpenOffice is similar).
> >
> > I'm finding that making very small changes in big documents (with
> > embedded images) results in rapid growth of the repository, since the
> > binary diff algorithm seems to not be able to figure out efficient
> > deltas for this type of documents, even though analysis of the
> > contents shows that they are almost unchanged.
> >
> > This may be outside the scope of format7, but I thought I'd ask the
> > question nevertheless.
>
> It is outside the scope, format7 is about physical storage layout and
> does not affect the delta/compression layer -- which is the one
> responsible for the effect you're seeing.
>
> We're aware of the issues regarding compressed files, and I expect will
> eventually come up with a solution. The problem just hasn't seemed all
> that important compared to other things we're trying to solve.
>
> That said, I'm sure we'd welcome any suggestions about how to handle
> such files more efficiently. I can think of a few (e.g., decompress the
> files before deltifying them), but it's always good to hear other points
> of view.
>

FWIW, the Branch Readme does imply he intends to work on some things that
might have an impact here. Specifically:

TxDelta v2
----------

Version 1 of txdelta turns out to be limited in its effectiveness for
larger files when data gets inserted or removed. For typical office
documents (zip files), deltification often becomes ineffective.

Version 2 shall introduce the following changes:

- increase the delta window from 100kB to 1MB
- use a sliding window instead of a fixed-sized one
- use a slightly more efficient instruction encoding

When introducing it, we will make it an option at the txdelta interfaces
(e.g. a format number). The version will be indicated in the 'SVN\x1' /
'SVN\x2' stream header. While at it, (try to) fix the layering violations
where those prefixes are being read or written.

Large file storage
------------------

Even most source code repositories contain large, hard to compress,
hard to deltify binaries. Reconstructing their content becomes very I/O
intense and it "dilutes" the data in our pack files. The latter makes
e.g. caching, prefetching and packing less efficient.

Once a representation exceeds a certain configured threshold (16M default),
the fulltext of that item will be stored in a separate file. This will
be marked in the representation_t by an extra flag and future reps will
not be deltified against it. From that location, the data can be forwarded
directly via SendFile and the fulltext caches will not be used for it.

Note that by making the decision contingent upon the size of the deltified
and packed representation, all large data that benefits from these will
still be stored within the rev and pack files.

-- 
Thanks
Mark Phippard
http://markphip.blogspot.com/
Received on 2013-02-28 17:29:19 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.